Home

Administrator Manual - Support

1. internalnet My Cluster gt My Clusters vE My Cluster gt Switches ij Networks externalnet Domain name cm cluster _ External network internalnet e b 9 Power Distribution b Software Images b Node Categories b 0 Head Nodes b Chassis v0 Slave Nodes node001 node002 gt E GPU Units b E Other Devices b Node Groups amp Users amp Groups gt Workload Manage bA Monitoring Configu Authorization B Authentication Name internainet v Management network Base address 10 l 141 0 0 Netmask bits 16 o Gateway o o o MTU 1500 Dynamic range start 10 141 128 0 v Allow node booting Dynamic range end 19 141 143 255 Figure 4 4 Network Settings Property Description Name Name of the network Domain name DNS domain associated with the network External network Switch to treat the network as an external net work Base address Base address of the network also known as the network address Netmask bits Prefix length or number of bits in netmask The part after the in CIDR notation Figure 4 5 Network properties In basic networking concepts a network is a range of IP addresses The first address in the range is the base address The length of the range Bright Computing Inc 4 2 Network Settings 61 i e the subnet is determined b
2. lt 3 Authorization lara ra legacwi very al Refresh Authentication Figure 7 2 cmgui Certificates Tab After clicking on the Add button of the Certificates tab a dialog comes up in which the certificate is set up and a profile selected Fig ure 7 3 Clicking on the Add button in Figure 7 3 saves the certificate and gen erates a pfx Another dialog then opens up to prompt the user for the path to where the key is to be saved A password to protect the key with is also asked for Figure 7 4 Users that use this certificate for their cmgui clients are then restricted to the set of tasks allowed by their profile and carry out the tasks with the privileges of the specified system login name peter in Figure 7 3 Bright Computing Inc 7 5 Tokens And Profiles 133 Name democert Organization a Organizational Unit b Location State d Country ef Key length 1024 X Valid until 04 Apr 2012 323 Profile readonly X System login peter a Figure 7 3 cmgui Add Certificate And Profile Dialog Filename home peter peterfile pfx Browse Password sss ea Go Figure 7 4 cmgui Password protect Key And Save Bright Computing Inc Workload Management For clusters that have many users and a significant load a workload man agement system allows a more efficient use of resources to be enforced for all users than if there w
3. the health check ldap checks if the ldap service is running It tests the ability to look up a user on the LDAP server using cmsupport as the default user If a value is specified for the parameter it uses that value as the user instead the health check portchecker takes parameter values such as 192 168 0 1 22 to check the if host 192 168 0 1 has port 22 open Bright Computing Inc 10 4 Monitoring Configuration With Cmgui 185 e Log length The maximum number of samples that are stored for the health check 3000 by default e Sampling interval The time between samples 120s by default e Prejob Clicking on this button sets the health check to run before a new job is run from the scheduler of the workload management system instead of running at regular intervals e Gap size The number of missing samples allowed before a null value is stored as a sample value 2 by default Threshold duration Number of samples in the threshold zone be fore a health check state is decided to have changed 1 by default e Fail severity The severity value assigned to a FAIL response for a health check 10 by default e Unknown severity The severity value assigned to an UNKNOWN response for a health check 10 by default e Options checkboxes Store If ticked the health check state data values are saved to the database Note that health state changes and actions still take place even if no values are stored
4. 0 0 0 0 00 0000 I 5 Environment Variables 000004 235 235 239 243 249 253 255 265 265 269 270 271 272 273 274 274 277 281 281 282 283 283 285 285 286 287 289 289 297 299 301 Table of Contents I 6 Metric Collections Examples 0 0 0 304 Changing The Network Parameters Of The Head Node 305 LI Introduction s cea a hs hie Slee gs AR BORA OO 305 J 2 Method ace dpe ee ele a 2 edhe Paw ee bees ee 305 J3 Terminology senn er bw ee ee ee a 307 Preface Welcome to the Administrator Manual for the Bright Cluster Manager 5 1 cluster environment 0 1 Quickstart For readers who want to get a cluster up and running as quickly as pos sible with Bright Cluster Manager Appendix F is a quickstart installation guide 0 2 About This Manual The rest of this manual is aimed at helping system administrators install understand and manage a cluster running Bright Cluster Manager so as to get the best out of it The Administrator Manual covers administration topics which are spe cific to the Bright Cluster Manager environment Readers should already be familiar with basic Linux system administration which the manual does not generally cover Aspects of system administration that require a more advanced understanding of linux concepts for clusters are ex plained appropriately This manual is not intended for users interested only in interacting with the clu
5. FrozenFile directive Syntax FrozenFile filename filename Syntax FrozenFile filenamel filename2 Example FrozenFile etc dhcpd conf etc postfix main cf The FrozenFile directive is used to prevent files from being automati cally generated This is useful when site specific modifications to config uration files have to be made SyslogHost directive Syntax SyslogHost hostname Default SyslogHost localhost The SyslogHost directive specifies the hostname of the syslog host SyslogFacility directive Syntax SyslogFacility facility Default SyslogFacility LOG_LOCAL6 The value of facility must be LOG_KERN LOG_USER LOG_MAIL LOG_DAEMON LOG_AUTH LOG_SYSLOG or LOG_LOCALO 7 Bright Computing Inc Disk Partitioning Bright Cluster Manager requires that disk partitionings are specified us ing the XML format that is described below Partitioning is relevant when the disk layout for nodes is being configured but also when the head node is initially installed For nodes the XML format also allows diskless operation D 1 Structure of Partitioning Definition The global structure of a file that describes a partitioning setup is de fined using an XML schema The schema file is installed on the head node in cm node installer scripts disks xsd This section shows the schema the next sections contain a few examples with an explanation of all elements lt xml version 1 0
6. 5 1 Configuring Power Parameters 75 Eile Monitoring View Help RESOURCES apc0l E My Cluster vj My Clusters Overview a vE My Cluster v switches es switch01 vE networks State Up Model AP7920 externalnet amp internalnet Gi Power Distribution Units Port Assignment v Software Images Port jia aT default image v Node Categories slave v Head Nodes E mycluster v Slave Nodes E node001 node002 v Other Devices v Node Groups A Users amp Groups Workload Management IA Monitoring Configuration _ EVENT VIEWER Q 9 ON node001 ON node002 ON mycluster oN Aue wne o z All Events v Time l Cluster bd Soure Y Message v amp Figure 5 2 PDU Overview mycluster gt device node001 get powerdistributionunits apc01 6 apc01 7 apc01 8 mycluster gt device node001 removefrom powerdistributionunits apc01 7 mycluster gt device node001 get powerdistributionunits apc01 6 apc01 8 mycluster gt device node001 set powercontrol apc mycluster gt device node001 get powercontrol apc mycluster gt device node001 commit 5 1 3 Combining PDU and IPMI Based Power Control By default when nodes are configured for IPMI Based Power Control any configured PDU ports are ignored However it is sometimes useful to change this behavior For example in the CMDaemon configuration file directives in cm local
7. For the associated compute nodes the execution log exists in cm shared apps sge current default spool node lt number gt messages where node lt number gt is the node name for example node001 node002 8 4 2 Torque Installation Initialization And Configuration Torque is a resource manager controlling the jobs and compute nodes it talks with Torque has its own built in scheduler but since this is quite basic the open source Maui and the proprietary Moab schedulers are rec ommended alternatives Bright Computing Inc 142 Workload Management Installing Torque The Torque package is installed but not set up by default on Bright Clus ter Manager 5 1 If it is not set up during installation Figure 2 17 then when it is set up later it must be initialized with the following script us ing the q flag cm shared apps torque current cm cm install torqueserver q The execution daemon pbs_mom is already in the node images by de fault and does not need to be installed even if Maui or Moab are added The Torque services can be enabled via role assignment as described in section 8 3 Torque software components are installed in cm shared apps torque current also referred to as the PBS_HOME The torque environment module which sets PBS_HOME and other environment variables must be loaded in order to submit jobs to Torque Torque documentation is available at the Adaptive Computing web site at http ww
8. The SnmpSessionTimeout specifies the time out for SNMP calls in mi croseconds PowerOffPDUOutlet directive Syntax Power0ffPDUOutlet truelfalse Default Power0ffPDUOutlet false On clusters with both PDU and IPMI power control the PowerOff PDUOutlet allows when enabled for PDU ports to be powered off as well to conserve power See section 5 1 3 for more information MetricAutoDiscover directive Syntax MetricAutoDiscover truelfalse Default MetricAutoDiscover true Scan for new hardware components which are not monitored yet and schedule them for monitoring UseHWTags directive Syntax UseHWTags truelfalse Default UseHWTags false When UseHWTags is set to true the boot procedure for unknown nodes will require the administrator to enter a HWTag on the console DisableBootLogo directive Syntax DisableBootLogo truelfalse Default DisableBootLogo false When DisableBootLogo is set to true the Bright Cluster Manager logo will not be displayed on the first boot menu StoreBlOSTimelnUTC directive Syntax StoreBIOSTimeInUTC truelfalse Default StoreBIOSTimeInUTC false Bright Computing Inc 262 CMDaemon Configuration File Directives When StoreBIOSTimeInUTC is set to true the BIOS time in nodes will be stored in UTC rather than local time FreezeChangesToSGEConfig directive Syntax FreezeChangesToSGEConfig truelfalse Default FreezeChangesToSGEConfig false When FreezeChanges
9. lt random string set during installation gt The DBPass directive specifies the password that will be used to connect to the MySQL database server DBName directive Syntax DBName database Default DBName cmdaemon The DBName directive specifies the database that will be used on the MySQL Bright Computing Inc 258 CMDaemon Configuration File Directives database server to store CMDaemon related configuration and status in formation DBMonName directive Syntax DBMonName database Default DBMonName cmdaemon_mon The DBMonName directive specifies the database that will be used on the MySQL database server to store monitoring related data DBUnixSocket directive Syntax DBUnixSocket filename The DBUnixSocket directive specifies the named pipe that will be used to connect to the MySQL database server if it is running on the same ma chine DBUpdateFile directive Syntax DBUpdateFile filename Default DBUpdateFile cm local apps cmd etc cmdaemon_upgrade sql The DBUpdateFile directive specifies the path to the file that contains in formation on how to upgrade the database from one revision to another EventBucket directive Syntax EventBucket filename Default EventBucket var spool cmd eventbucket The EventBucket directive specifies the path to the named pipe that will be created to listen for incoming events EventBucketFilter directive Syntax EventBucketFilter filename D
10. mc gt s of twareimage default image gt kernelmodules e1000 Note that after committing the change it can take some time typically a minute before the ramdisk creation is done 6 7 5 Node Installer Cannot Create Disk Layout When the node installer is not able to create a drive layout it displays a message similar to figure 6 25 The node installer log file contains some thing like Mar 24 13 55 31 10 141 0 1 Mar 24 13 55 31 10 141 0 1 Mar 24 13 55 31 10 141 0 1 filesystems Mar 24 13 55 32 10 141 0 1 not found Mar 24 13 55 32 10 141 0 1 not found Mar 24 13 55 32 10 141 0 1 sda dev hda Mar 24 13 55 32 10 141 0 1 are missing corrupt Exit Mar 24 13 55 32 10 141 0 1 Mar 24 13 55 32 10 141 0 1 not found Mar 24 13 55 32 10 141 0 1 node installer node installer node installer node installer node installer node installer node installer code 4 signal node installer node installer node installer Installmode is AUTO Fetching disks setup Checking partitions and Detecting device dev sda Detecting device dev hda Can not find device s dev Partitions and or filesystems 0 Creating new disk layout Detecting device dev sda Detecting device dev hda Bright Computing Inc 6 7 Troubleshooting The Node Boot Process 115 not found Mar 24 13 55 32 10 141 0 1 node installer Can not find device s dev sda dev hda Mar 24
11. the value cannot be set and the command is simply the name of the metric Bright Computing Inc H 1 Metrics And Their Parameters 295 Cumulative If set to yes then the value is cumulative for example the bytes received counter for an ethernet interface If set to no de fault then the value is not cumulative for example temperature Description Description of the metric Empty by default Disabled If set to no default then the script runs Extended environment If set to yes more information about the device is made part of the environment to the script The default is no Measurement Unit A unit for the metric A percent is indicated with Name The name given to the metric Only when idle If set to yes the metric script runs only when the sys tem is idling Useful if the metric is resource hungry in order to burden the system less It is set to no by default Parameter permissions Decides if parameters passed to the metric script can be used The three possible values are e disallowed parameters are not used e required parameters are mandatory e optional default parameters are optional Retrieval method e cmdaemon default Metrics retrieved internally using CMDae mon e snmp Metrics retrieved internally using SNMP Sampling method e samplingonmaster The head node samples the metric on be half of a device For example the head node may do this for a PDU because the PDU does not
12. 10 7 Monitoring Modes With Cmsh 193 e monitoring actions e monitoring healthchecks e monitoring metrics e monitoring setup The word monitoring is therefore merely a grouping label prefixed insep arably to these 4 modes The syntax of the 4 bulleted commands above is thus consistent with that of the other top level cmsh modes The sections 10 7 1 10 7 2 10 7 3 and 10 7 4 give examples of how objects are handled under these 4 monitoring modes To avoid repeating similar descriptions section 10 7 1 will be relatively detailed and will often be referred to by the other sections 10 7 1 Cmsh Monitoring Actions The monitoring actions mode of cmsh corresponds to the cmgui actions tab of section 10 4 6 The monitoring actions mode handles actions objects in the way de scribed in the introduction to working with objects section 3 6 3 A typ ical reason to handle action objects the properties associated with an action script or action built in might be to view the actions available or to add a custom action for use by for example a metric or health check This section continues the cmsh session started above giving examples of how the monitoring actions mode is used cmsh monitoring actions list show and get The list command by default lists the names and command scripts avail able in monitoring actions mode Example myheadnode monitoring actions myheadnode gt monitoring gt actions list Name key Co
13. 210 Day to day Administration sandbox brightcomputing com Allow a Bright Computing engineer ssh access to the cluster Y n Enter additional information for Bright Computing eg related ticket number problem description Y n End input with ctrl d Ticket 1337 the florbish is grommicking Thank you Added temporary Bright public key The screen clears and the tunnel opens up displaying the following no tice REMOTE ASSISTANCE REQUEST HEHEHE RHHHE HHH RHEE EHH AREER EHH RARE HH RRHEH HHH RAH H HHE A connection has been opened to Bright Computing Support Closing this window will terminate the remote assistance session Hostname bright51 NOFQDN Connected on port 7000 ctrl c to terminate this session Bright Computing support automatically receives an e mail alert that an engineer can now tunnel into the cluster When the engineer has ended the session the administrator may remove the tunnel with a ctrl c and the display then shows Tunnel to sandbox brightcomputing com terminated Removed temporary Bright public key root bright51 11 4 Backups Bright Cluster Manager does not include facilities to create backups of a cluster installation When setting up a backup mechanism it is recom mended that the full file system of the head node i e including all node images is backed up Unless the node hard drives are used to store im portant data it is not necessary to back up nodes If no bac
14. Bright Cluster Manager 5 1 Administrator Manual Revision 6775 Date Fri 27 Nov 2015 oe Bright Compuring 2011 Bright Computing Inc All Rights Reserved This manual or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Bright Computing Inc Trademarks Linux is a registered trademark of Linus Torvalds Pathscale is a regis tered trademark of Cray Inc Red Hat and all Red Hat based trademarks are trademarks or registered trademarks of Red Hat Inc SuSE is a reg istered trademark of Novell Inc PGI is a registered trademark of The Portland Group Compiler Technology STMicroelectronics Inc SGE is a trademark of Sun Microsystems Inc FLEXIm is a registered trademark of Globetrotter Software Inc Maui Cluster Scheduler is a trademark of Adaptive Computing Inc ScaleMP is a registered trademark of ScaleMP Inc All other trademarks are the property of their respective owners Rights and Restrictions All statements specifications recommendations and technical informa tion contained herein are current or planned as of the date of publication of this document They are reliable as of the time of this writing and are presented without warranty of any kind expressed or implied Bright Computing Inc shall not be liable for technical or editorial errors or omissions which may occur in this document Bright Computing Inc shall not be liable for any damages resulting from
15. Bright Computing Inc 98 Node Provisioning the ramdisk and rebooting the node to get the interfaces up again may be necessary initializing IPMI interfaces IPMI interfaces if present and set up in the node s configuration are also initialized with correct IP address netmask and user password settings restarting the network interfaces At the end of this step i e Section 6 3 3 the network interfaces are up When the node installer has completed the remainder of its 13 steps Sec tions 6 3 4 6 3 13 control is handed over to the local init process run ning on the local drive During this handover the node installer brings down all network devices These are then brought back up again by init by the distribution s standard networking init scripts which run from the local drive and expect networking devices to be down to begin with 6 3 4 Determining Install Mode Type And Execution Mode Stored install mode values decide whether synchronization is to be applied fully to the local drive of the node only for some parts of its filesystem not at all or even whether to drop into a maintenance mode instead Related to install mode values are execution mode values that deter mine whether to apply the install mode values to the next boot to new nodes only to individual nodes or to a category of nodes These values are merely determined at this stage nothing is executed yet install mode values The install mode can ha
16. CMDMemUsed Resident memory used by CMDaemon CMDState State in which CMDaemon is running head 0 node 1 failover 2 CMDSyst ime Time spent by CMDaemon in system mode CMDUsert ime Time spent by CMDaemon in user mode continues Bright Computing Inc 290 Metrics Health Checks And Actions Table H 1 1 List Of Metrics continued Name Description CPUCoresAvailable Cluster wide number of CPU cores CPUIdle Total core usage in idle tasks per second CPUIrg Total core usage in servicing interrupts per second CPUNice Total core usage in nice d user mode per sec ond CPUSoftIrg Total core usage in servicing soft interrupts per second CPUSystem Total core usage in system mode per second CPUUser Total core usage in user mode per second CPUWait Total core usage in waiting for I O to com plete per second CacheMemory System memory used for caching CompletedJobs Jobs completed CtxtSwitches Number of context switches per second DevicesUp Number of devices in status UP DropRecv Number of received packets which are dropped DropSent Number of packets sent which are dropped ErrorsRecv Number of received packets with error ErrorsSent Number of packets sent which have error EstimatedDelay Estimated Delay to execute jobs FailedJobs Failed jobs Forks Number of forks since boot per second FreeSpace Free space for non root on a mount point GPUAvailable Cluster wide number of GPUs IO0InProgres
17. Disabled If ticked the health state script does not run and no health check state changes or actions associated with it occur If Store is ticked the value it stores while Disabled is ticked for this health check configuration is an UNKNOWN value Only when idle If ticked the health check script is only run when the system is idling This burdens a system less and is useful if the health check is resource hungry e Pass action Fail action Unknown action State Flapping These are all action launchers which launch an action for a given health state PASS FAIL UNKNOWN or for a flapping state de pending on whether these states are true or false Each action launcher is associated with three input boxes The first selection box decides what action to launch if the state is true The next box is a plain text entry box that allows a parameter to be passed to the ac tion The third box is a selection box again which decides when to launch the action depending on which of the following conditions is met Enter if the state has just started being true That is the cur rent sample is in that state and the previous sample was not in that state During if the state is true and ongoing That is the current and previous state sample are both in the same state Leave if the state has just stopped being true That is the current sample is not in that state and the previous sample was in that state Bright Computing
18. less of scheduler type with the list command Example bright51 gt jobs list Type Job ID User Queue Status SGEJob 620 maud all q r SGEJob 621 maud qw TorqueJob 90 bright5i maud hydroq R Also within the jobs mode the hold release suspend resume show and remove commands act on jobs when used with a specified scheduler type and job ID Continuing with the example bright51 gt jobs suspend torque 90 bright51 cm cluster Success bright51 gt jobs list Type jobid User Queue Status SGEJob 620 maud all q r SGEJob 621 maud qw TorqueJob 90 bright5i maud hydroq S While at the jobs mode top level the suspended job here can be made to resume using suspend s complementary command resume However resume along with the other commands can also be executed within a scheduler submode as is shown shortly jobs Mode In cmsh The scheduler Submode Setting the scheduler type sets the scheduler submode and can be done thus continuing with the preceding example bright51 gt jobs scheduler torque bright51 gt jobs torque 1 The submode restriction can be unset with scheduler The top level job mode commands executed within the scheduler submode then only apply to jobs running under that scheduler The list and resume commands for example then only apply only to jobs running under torque continuing with the example bright51 gt jobs torque list no sge jobs listed now only torque Type Job ID U
19. numeric ids exclude etc HOSTNAME exclude etc localtime exclude proc exclude losttfound exclude sys exclude root ssh exclude var lib dhcpcd exclude media floppy exclude etc motd exclude root bash_history exclude root CHANGES exclude var spool mail exclude etc udev rules d 30 net_persistent_names rules exclude rhn exclude etc sysconfig rhn systemid exclude etc sysconfig rhn systemid save exclude var spool up2date exclude root mbox exclude var cache yum exclude etc cron daily rhn updates root basehost64 cm image new image The first step that of building the base archive is now done 9 5 2 Creating The Software Image With cm create image The second step that of creating the image from the base archive now needs to be done This uses the cm create image utility which is part of the cluster tools package USAGE cm create image x lt base tar gt b lt image gt d n lt name gt lt base tar gt Path to gzipped base tar file tar gz tgz lt image gt Path to the directory containing base distribution image or empty directory where base tar should be extracted to Bright Computing Inc 9 5 Creating Custom Software Images 161 lt name gt Name of software image by default it will be the base name of
20. ure 3 8 Bright Computing Inc 3 6 Cluster Management Shell 39 Usage cmsh options Connect to localhost using default port cmsh options lt certificate i certfile gt lt key k keyfile gt lt host port gt Connect to a cluster using certificate and key in PEM format cmsh options lt certificate i certfile gt password p password lt uril port gt Connect to a cluster using certificate in PFX format Valid options helpl h spice eda tadagas andere s Display this help noconnect u 005 Start unconnected controlflag z ETX in non interactive mode oNOSS1 8 ssc ceced dad eden d Do not use SSL noredirect r 02 Do not follow redirects S NOFC 0 4 vee aig vee ed Re ee Do not load cmshrc file on start up command c lt c1 c2 gt Execute commands and exit file f lt filename gt Execute commands in file and exit 6ChO X gt ir iweds 224eob2eau dads Echo all commands quit q ceras tiganii iniisa Exit immediately after error Figure 3 8 Usage information for cmsh 3 6 2 Levels Modes Help And Commands Syntax In cmsh The top level of cmsh is the level that cmsh is in when entered without any options To avoid overloading a user with commands cluster management functionality has been grouped and placed in separate cmsh modes Modes and their levels are a hierarchy available below th
21. By default the cluster responds to ICMP ping packets and allows SSH ac cess from the whole world Depending on site policy access to port 8081 may also be enabled to allow access to the cluster management daemon To remove all rules for example for testing purposes the clear option should be used This then allows all network traffic through shorewall clear Administrators should be aware that in Red Hat distribution vari ants the service shorewall stop command corresponds to the shorewall stop command and not to the shorewall clear com mand The stop option blocks network traffic but allows a pre defined minimal safe set of connections and is not the same as completely re moving Shorewall from consideration This differs from Debian like dis tributions where service shorewall stop corresponds to shorewall clear and removes Shorewall from consideration Full documentation on Shorewall is available at http www shorewall net 12 3 Compilers Bright Computing provides convenient RPM packages for several com pilers that are popular in the HPC community All of those may be in stalled through yum but with the exception of GCC require an installed license file to be used 12 3 1 GCC Package names gcc recent 12 3 2 Intel Fortran and C Compilers Package names intel fc and intel cc The Intel compiler packages include the Intel Fortran and Intel C compilers For both compilers
22. CATEGORY e g anycategory anothercategory yetanothercategory GROUP e g anygroupname anothergroupname yetanothergroupname CHASSIS e g anychassisname anotherchassisname yetanotherchassisname NODES e g node001 node015 node20 node028 node030 PDUPORT e g apcOl or apc01 8 apc01 5 SECONDS e g 0 2 By default 1 second Figure 5 6 Synopsis Of power Command In device Mode Of cmsh Bright Computing Inc 5 3 Monitoring Power 81 5 3 Monitoring Power Monitoring power consumption is important since electrical power is an important component of the total cost of ownership for a cluster The monitoring system of Bright Cluster Manager collects power related data from PDUs in the following metrics e PDUBankLoad Phase load in amperes for one specified bank in a PDU PDULoad Total phase load in amperes for one PDU Chapter 10 on cluster monitoring has more on metrics and how they can be visualized Bright Computing Inc Node Provisioning By default nodes boot from the network when using Bright Cluster Man ager and a network boot sometimes called a PXE boot is recommended as a BIOS setting for nodes On disked nodes gPXE software is placed by default on the drive during node installation If the boot instructions from the BIOS for PXE boot fail and if the BIOS instructions are that a boot attempt should then be made from the hard drive it means that a PXE network boot attempt is done
23. Inc 186 Cluster Monitoring Health Check cputhreshcheck X Parameter Log length 6000 Sampling interval 120 Prejob Gap size 2 Threshold duration 1 Fail severity 10 Unknown severity 10 Options V Store Disabled Only when idle Pass action Failaction killallyes X Enter Unknown action State Flapping killallyes X Enter Peon ot Figure 10 23 cmgui Monitoring Health Check Configuration Edit Dialog Elle Monitoring View _ Help RESOURCES A Monitoring Configuration E My Custer vee t ne z N 3 My Clusters Overview Metric Configuration Health Check Configuration Metrics HealthChecks Actions vE My Cluster Vila Switches Modified wi Name v s v Command E switchor CPUCoresAvailable Cluster lt built in gt v Networks CPUIdle cpu lt built in gt CPUIr CPU lt builtin gt antiin CPUNice CPU lt built in gt al e CPUSoftirq CPU lt builtin gt v Power Distribution Units CPUS CPU lt builtin gt vE Sofware Images S kman uiltin default image gt C Node Categories b ij Head Nodes gt ii chassis gt Gislave Nodes gt Gi GPU units CPUWait cpu lt built in gt CtxtSwitches Operating System lt built in gt DevicesUp Cluster lt built in gt Add collection Figure 10 24 cmgui Monitoring Main Metrics Tab 10 4 4 Metrics The Metrics tab displays the list of metrics th
24. Lroot oss001 echo dev sdb mnt ost01 lustre rw _netdev 0 0 gt gt e tc fstab After mounting the OST s the Lustre clients can mount the Lustre filesys tem Bright Computing Inc 12 6 Lustre 231 12 6 3 Client Implementation There are several ways to install a Lustre client If the client has a supported kernel version the lustre client RPM and lustre client modules RPM can be installed The lustre client modules package installs the required kernel modules If the client does not have a supported kernel a Lustre kernel Lustre modules and Lustre userland software can be installed with RPM pack ages The client kernel modules and client software can also be built from source Creating The Lustre Client Image Method 1 This method describes how to create a Lustre client image with Lustre client RPM packages It requires that the lustre client module package have the same kernel version as the kernel version used for the image To create a starting point image for the Lustre client image a clone is made of the existing software image for example from default image The clone software image is created via cmgui Figure 12 1 or using cmsh on the head node Example root mycluster cmsh mycluster 4 softwareimage mycluster gt softwareimage clone default image lustre client image mycluster gt softwareimage lustre client image commit The RPM Lustre client packages are downloaded from t
25. Pinging an IPMI interface can be used to verify that the IPMI interface of a head node is reachable from its coun terpart Example On mycluster1 verify that the IPMI interface of mycluster2 is reachable root mycluster1 ping c 1 mycluster2 ipmi cluster PING mycluster2 ipmi cluster 10 148 255 253 56 84 bytes of data 64 bytes from mycluster2 ipmi cluster 10 148 255 253 icmp_seq 1 tt1 64 time 0 033 ms On mycluster2 verify that the IPMI interface of mycluster1 is reachable root mycluster2 ping c 1 myclusterl ipmi cluster PING mycluster1 ipmi cluster 10 148 255 254 56 84 bytes of data 64 bytes from myclusteri ipmi cluster 10 148 255 254 icmp_seq 1 tt1 64 time 0 028 ms While testing an HA setup with automated failover it can be useful to simulate a kernel crash on one of the head nodes The following com mand can be used to crash a head node instantly echo c gt proc sysrq trigger After the active head node freezes as a result of the crash the pas sive head node will power off the machine that has frozen and will then proceed to switch to active mode 13 3 Managing HA Once an HA setup has been created there are several things to be aware of while managing the cluster 13 3 1 cmha utility The main utility for interacting with the HA subsystem is cmha Using cmha an administrator may query the state the HA subsystem is in on the local machine An administrator may also manually initiate a failov
26. Start Rescue Environment to boot the node into a Linux ramdisk environment Once the rescue environment has finished booting login as root No password is required Execute the following command cm cm clone install failover When prompted to enter a network interface to use enter the in terface that was used to boot from the internal cluster network e g ethoO eth1 When unsure about the interface switch to another console and use ethtool p lt interface gt to make the NIC corre sponding to an interface blink If the provided network interface is correct a root master s password prompt will appear Enter the root password After the cloning process has finished press Y to reboot and let the machine boot off its harddrive Once the secondary head node has finished booting from its hard drive go back to the primary head node and select Finalize Enter the MySQL root password Verify that the mysql ping and status are listed as OK for both head nodes This confirms that the HA setup was completed successfully The backupping will initially report FAILED but will start working as soon as the secondary head node has been rebooted Press OK and then Reboot to reboot the secondary head node Wait until the secondary head node has fully booted and select Failover Status from the main menu After that select View failover status and confirm that backupping is also reported as OK 13 2 3 Shared Storage Setu
27. Unmount script cm local apps cmd scripts drbd unmount sh Warn time 5 mycluster1 gt partition base gt failover Keep alive The passive head node will use the value specified as Keep alive asa frequency for checking that the active head node is still up If a dedicated failover network is being used there will be 3 separate heartbeat checks for determining that a head node is reachable Warn time When a passive head node determines that the active head node is not re sponding to any of the periodic checks for a period longer than the Warn time a warning is logged that the active head node might become un reachable soon Dead time When a passive head node determines that the active head node is not responding to any of the periodic checks for a period longer than the Dead time the active head node is considered dead and a quorum is initialized Depending on the outcome of the quorum a failover sequence may be initiated Bright Computing Inc 13 3 Managing HA 247 Failover network The Failover network setting determines that network that will be used as a dedicated network for the backupping heartbeat check This is nor mally a direct cable from a NIC on one head node to a NIC on the other head node Init dead When boot head nodes are booted simultaneously the standard Dead time might be too strict if one head node requires a bit more time for booting than the other For this reason when the node boots
28. Users amp Groups Workload Management snitoring Configuration Figure 10 15 cmgui Monitoring Configuration Tabs e Health Checks e Actions The tabs are now discussed in detail 10 4 1 The Overview Tab The Overview tab of Figure 10 15 shows an overview of custom thresh old actions and custom health check actions that are active in the system Each row of conditions in the list that decides if an action is launched is called a rule Only one rule is on display in Figure 10 15 showing an overview of the metric threshold action settings which were set up in the basic example of section 10 1 The Add rule button runs a convenient wizard that guides an admin istrator in setting up a condition and thereby avoids having to go through the other tabs separately The Remove button removes a selected rule The Edit button edits aspects of a selected rule It opens a dialog that edits a metric threshold configuration or a health check configura tion These configuration dialog options are also accessible from within the Metric Configuration and Health Check Configuration tabs The Revert button reverts a modified state of the tab to the last saved state The Save button saves a modified state of the tab 10 4 2 The Metric Configuration Tab The Metric Configuration tab allows device categories to be selected for the sampling of metrics Properties of metrics related to the taking of samples can then be configured from this
29. be booted from the default image copy on the head node via a network boot again Typically this is done by manual intervention during node boot to select network booting from the BIOS of the node As suggested by the Bright Cluster Manager gPXE boot prompt set ting network booting to work from the BIOS regular PXE booting is preferred to gPXE booting from the disk 6 3 11 Running Finalize Scripts A finalize script is similar to an initialize script section 6 3 5 only it runs a few stages later in the node provisioning process A finalize script is used when custom commands need to be ex ecuted after the preceding mounting provisioning and housekeeping steps but before handing over control to the node s local init process For example custom commands may be needed to initialize some unsup ported hardware or to supply a configuration file that cannot be added to the provisioned image because it needs node specific settings Such custom commands are then added to the finalize script A finalize script can be added to both a node s category and the node configuration The node installer first runs a finalize script if it exists from the node s category and then a finalize script if it exists from the node s configuration The node installer sets several environment variables which can be used by the finalize script Appendix E contains an example script which documents these variables Similar to the finalize script
30. cm local a pps cmd scripts powerscripts ilo_power p1 mycluster device foreach c slave set powercontrol custom mycluster device commit Bright Computing Inc 78 Power Management Eile Monitoring View Help RESOURCES mycluster E My Custer se My Clusters i Overview Tasks Settings System Information Services Process Management N YVE My Cluster E switcho1 v GNeworks externalnet Te aie internalnet y Power Distribution Units 4 apc0l Ee of SSE Add to node group _ lt new gt B default image v Node Categories slave Misc Root Shell v Head Nodes v Slave Nodes S aa Ed nodeoo1 E node002 v H Other Devices v Node Groups Users amp Groups Workload Management 2 Monitoring Configuration _ EVENT VIEWER Q 9 All Events v Time 4 Cluster wi Soure v Message vR Figure 5 3 Head Node Tasks 5 2 Power Operations Power operations may be done on devices from either cmgui or cmsh There are four main power operations e Power On power on a device e Power Off power off a device e Power Reset power off a device and power it on again after a brief delay e Power Status check power status of a device 5 2 1 Power Operations With cmgui In cmgui buttons for executing On Off Reset operations are located un der the Tasks tab of a device Figure 5 3 shows the Tasks tab for a head node The Overview tab of a device can be used to
31. conf see Appendix C where lt workload manager gt takes the value of SGE Torque or PBS as appropriate A very short guide to some specific workload manager commands that can be used outside of the Bright Cluster Manager 5 1 system is given in Appendix G 8 4 1 SGE Installation Initialization And Configuration Installing SGE The SGE package comes with a Bright Cluster Manager 5 1 installation even if another or no workload manager was chosen for configuration and set up To set it up for use for the very first time the workload manager server role is initialized typically on the head node using the cm install qmaster script with the q option cm shared apps sge current cm cm install qmaster q The h option displays a help text listing the other options for this script One of these options is c lt image gt where lt image gt is the path to the node image The c option can be used to place the execution dae mon in the node image Example cm shared apps sge current cm cm install qmaster c cm images default image Bright Computing Inc 8 4 Configuring And Running Individual Workload Managers 141 If there are provisioning nodes the updateprovisioners command sec tion 6 1 4 should be run The nodes can then simply be rebooted to pick up the new image or alternatively to avoid rebooting the imageupdate command section 6 5 2 can be run to pick up the new image from a pro vision
32. encoding IS0 8859 1 gt lt diskSetup xmlns xsi http www w3 org 2001 XMLSchema instance xsi noNamespaceSchemaLocation schema xsd gt lt device gt lt blockdev gt dev sda lt blockdev gt lt blockdev gt dev hda lt blockdev gt lt partition id ai gt lt size gt 5G lt size gt lt type gt linux lt type gt lt filesystemext3 lt filesystem gt lt mountPoint gt lt mountPoint gt lt mountOptions gt defaults noatime nodiratime lt mountOptions gt lt partition gt lt partition id a2 gt lt size gt 2G lt size gt lt type gt linux lt type gt lt filesystemext3 lt filesystem gt lt mountPoint gt var lt mountPoint gt lt mountOptions gt defaults noatime nodiratime lt mountOptions gt lt partition gt lt partition id a3 gt lt size gt 2G lt size gt lt type gt linux lt type gt Bright Computing Inc 270 Disk Partitioning lt filesystemext3 lt filesystem gt lt mountPoint gt tmp lt mountPoint gt lt mountOptions gt defaults noatime nodiratime nosuid nodev lt mountOptions gt lt partition gt lt partition id a4 gt lt size gt auto lt size gt lt type gt linux swap lt type gt lt partition gt lt partition id a5 gt lt size gt max lt size gt lt type gt linux lt type gt lt filesystemext3 lt filesystem gt lt mountPoint gt local lt mountPoint gt lt mountOptions gt defaults noatime nodiratime lt mountOptions gt lt pa
33. gt Copyright c 2004 2010 Bright Computing Inc All Rights Reserved This software is the confidential and proprietary information of Bright Computing Inc Confidential Information You shall not disclose such Confidential Information and shall use it only in accordance with the terms of the license agreement you entered into HH H HH H H H HOA with Bright Computing Inc This is the XML schema description of the partition layout XML file It can be used by software to validate partitioning XML files There are however a few things the schema does not check There should be exactly one root mountpoint unless diskless There can only be one partition with a max size on a particular device Something similar applies to logical volumes The auto size can only be used for a swap partition Partitions of type linux swap should not have a filesystem Partitions of type linux raid should not have a filesystem Partitions of type linux lvm should not have a filesystem If a raid is a member of another raid then it can not have a filesystem Partitions which are listed as raid members should be of type linux raid Bright Computing Inc 266 Disk Partitioning If diskless is not set there should be at least one device gt lt xs schema xmlns xs http www w3 org 2001 XMLSchema elementFormDefault qualified gt lt xs element name diskSet
34. has details on how the directives ProvisioningNodeAutoUpdateTimer and ProvisioningNodeAutoUpdate in cmd conf control aspects of how Bright Computing Inc 88 Node Provisioning updateprovisioners functions Example mycluster softwareimage updateprovisioners Provisioning nodes will be updated in the background Sun Dec 12 13 45 09 2010 myheadnode Starting update of software image s o n provisioning node s user initiated mycluster softwareimage updateprovisioners mycluster Sun Dec 12 13 45 41 2010 myheadnode Updating image default image on provis ioning node node001 mycluster Sun Dec 12 13 46 00 2010 myheadnode Updating image default image on provis ioning node node001 completed Sun Dec 12 13 46 00 2010 myheadnode Provisioning node node001 was updated Sun Dec 12 13 46 00 2010 myheadnode Finished updating software image s on provisioning node s 6 2 Software Images A software image is a complete Linux file system that is to be installed on a non head node Chapter 9 describes images and their management in greater detail The head node holds the head copy of the software images Whenever files in the head copy are changed using CMDaemon the changes auto matically propagate to all provisioning nodes via the updateprovisioners command Section 6 1 4 6 2 1 Selecting Kernel Driver Modules To Load Onto Nodes Each software image contains a Linux kernel and a ramdisk The ramdisk is loa
35. ipmio ipminet z 0 0 0 m 0 0 0 ee ac Figure 2 13 Network Interface Configuration type 1 Bright Cluster Manager Installer Network Interfaces English US g Head no l rk Interfac The f j ied ni tt ill t it d nod Q Interfac Network IP Address th managementnet 10 147 255 254 a nl Jnassigne m ng th Unassigned Jnassign eth0 0 Jn n E z Node Network Interfaces f 19 ved net k interfa ill t tup on then Pa ill t Figure 2 14 Network Interface Configuration type 3 Select Subnet Managers The Subnet Managers screen in Figure 2 15 is only displayed if an Infini Band network was defined and lists all the nodes that can run the Infini Band subnet manager The nodes assigned the role of a subnet manager are ticked and the Continue button is clicked to go on to the CD DVD ROMs selection screen described next Bright Computing Inc 2 3 Head Node Installation 17 Bright Cluster Manager Installer Subnet Managers English US fp lt a Figure 2 15 Subnet Manager Nodes Select CD DVD ROM The CD DVD ROMs screen in Figure 2 16 lists all detected CD DVD ROM devices If multiple drives are found then the drive with the Bright Clus ter Manager DVD needs to be selected by the administrator Clicking on Continue then brings up the Workload Management setup screen de scribed next Bright Cluster Manager Installer CD DVD ROMs English US jy re Virtual IDE CDROM Dr
36. this reason it is generally not recommended for end users to login to the secondary head node Although Bright Cluster Manager gives the administrator full flexibil ity on how shared storage is implemented between two head nodes there are generally three types being used NAS DAS and DRBD NAS In a Network Attached Storage NAS setup both head nodes mount a shared volume from an external network attached storage device In the most common situation this would be an NFS server either inside or out side of the cluster Because imported mounts can typically not be re exported which is true at least for NFS nodes typically mount filesystems directly from the NAS device DAS In a Direct Attached Storage DAS setup both head nodes share access to a block device that is usually acccessed through a SCSI interface This could be a disk array that is connected to both head nodes or it could be a block device that is exported by a corporate SAN infrastructure Although the block device is visible and can be accessed simultane ously on both head nodes the filesystem that is used on the block de vice is typically not suited for simultaneous access In fact simultaneous access to a filesystem from two head nodes must be avoided at all cost because it will almost certainly lead to filesystem corruption Only special purpose parallel filesystems such as GPFS and Lustre are capable of being accessed by two head nodes simultaneously DRBD I
37. 0 0 0 0 eth4 v 0 0 0 0 BOOTIF internalnet 0 j 0 To 0 ib0 X 0 0 o 0 ib1 X 0 0 o lo ipmio vy lo fo fo 0 el Previous MRB Figure 6 23 Node Creation Wizard Setting Interfaces Bright Computing Inc 112 Node Provisioning The default setting for IP offset is 0 0 0 0 and means the default IP address is suggested for assignment to each node in the range The default IP address is based on the node name with node001 having the value 10 141 0 1 and so on An offset of x implies that the xth IP address after the default is suggested for assignment to each node in the range Some care must be taken when setting IP addresses using the wizard since no duplicate IP address checking is done Example A node001 has its default IP address 10 141 0 1 The node005 is then added e If IP offset 0 0 0 0 then 10 141 0 5 is suggested for assignment to node005 because by default the node name is parsed and its de fault IP address suggested e If IP offset 0 0 0 2 then 10 141 0 7 is suggested for assignment to node005 because it is 2 IP addresses after the default 6 7 Troubleshooting The Node Boot Process During the node boot process there are several common issues that can lead to an unsuccessful boot This section describes some of these issues and their solutions It also provides general hints on how to analyze boot problems 6 7 1 Node Fails To PXE Boot Possible reasons to consider if a node
38. 30 48 30 73 92 bright51 gt device switch01 12 Bright Computing Inc 72 Configuring The Cluster When running showport CMDaemon on the head node queries all switches until a match is found If the switch is known as well as the MAC address then the switch can also be specified with the s option If this is done the query is carried out on that switch only Continuing the earlier example bright51 gt device showport s switch01 00 30 48 30 73 92 bright51 gt device switch01 12 Mapping All Port Connections In The Cluster With showport A list indicating the port connections and switches for all connected de vices that are up can be generated using this script Example bin bash for nodename in cmsh c device foreach get hostname do macad cmsh c device use nodename get mac echo n macad nodename cmsh c device showport macad done The script may take a while to finish its run It gives an output like Example 00 00 00 00 00 00 switch0O1 No ethernet switch found connected to this mac address 00 30 48 30 73 92 bright51 switch01 12 00 26 6C F2 AD 54 node001 switchO1 1 00 00 00 00 00 00 node002 No ethernet switch found connected to this mac address Bright Computing Inc Power Management Being able to control power inside a cluster through software is important for remote cluster administration and creates opportunities for power savings It can also be useful
39. 9 Power Distribution Units gt 9 Software Images provisioning slave default image Remove Revert Figure 6 1 cmgui Adding A provisioning Category Bright Computing Inc Node Provisioning Clicking on the provisioning category in the resource tree on the left hand side or alternatively double clicking on provisioning category in the Overview tabbed pane of the Node Categories right hand side pane then opens up the provisioning category Figure 6 2 jew Help RESOURCES anal provisioning E My Cluster vais My Clusters Ta gs Services sw Wo 5 jode vE My Cluster gt Switches v Networks v Provisioning Role Storage Role 5 externalnet 5 internainet gt C Power Distribution Units Software images Vv All gt E Software Images v default image v Nod tegories Node groups Provisioning slots 10 X J a gt v0 Head Nodes E myheadnode X Ready Figure 6 2 cmgui Configuring A provisioning Role Selecting the Roles tab in this category displays roles that are part of the provisioning category Ticking the checkbox of a role assigns the role to the category and displays the settings that can be configured for this role The Provisioning slots setting maxProvisioningNodes in cmsh decides how many images can be supplied simultaneously from the provisioning node while the Software images settings related to the images and allimage
40. AB Users amp Groups Workload Management EN Monitoring Configuration El Authorisation B Authentication Uptime 2 days 20 hours 8 minute Nodes 3tT0406 Devices 2t0406 Jobs O running O waiting Phase load 0 4 Disk Usage Mountpoint Used Size 17 31 GB 218 26 GB Metric 15 10 0 09 Nov 2009 12 00 00 LoadOne X CPU Cores Memory 1 09 GB outof 37 17 GB Users O outof 2 CPU Usage 0 17 u 0 02 s 0 01 o 99 81 i Occupation rate 0 04 Workload Management Queue Running Queued Eror Completed Avg Duration Est delay all q oO oO 10 7 5 seconds O seconds 09 Nov 2009 12 55 00 EVENT VIEWER t Q All Events v Time Cluster wi Soure vv Message v a Figure 3 4 Cluster Overview 3 5 Navigating the Cluster Management GUI Aspects of the cluster can be managed by administrators using cmgui Figure 3 4 The resource tree on the left side of the screen consists of hardware resources such as nodes and switches as well as non hardware resources such as Users amp Groups and Workload Management Selecting a resource opens an associated tabbed pane on the right that allows it to be managed The number of tabs displayed and their contents depend on the re source selected The following standard tabs are available for most re sources e Overview provides an overview containing the most important sta tus details for the resource e Tasks a
41. Cluster Manager arranging a reschedul ing of the job A node that has been put in a Drained state with a health check is not automatically undrained The administrator must clear such a state manually Bright Computing Inc 154 Workload Management Configuration Using cmgui To configure the monitoring of nodes as a prejob health check in cmgui the Monitoring Configuration resource item is selected and the Health Check Configuration tabbed pane is opened The default resource is chosen as a value for Health Check Configuration and the Add button is clicked on to add the health check via a dialog figure 8 15 In the dialog the Health Check script value is set to the chosen health check and the Sampling interval is set to prejob which automatically sets the Fail action to Drain node After saving these settings any node that is not in the Drained state in the default resource gets a pre job check when a job is scheduled for the node and the pre job check puts the node in a Drained state if it is unhealthy Ele Monitoring View Help RESOURCES A Monitoring Configuration Bright 5 1 cluster gt C Switches gt Gj Networks gt G Power Distribution Units a Health Check Configuration gt C Software Images Node Categories Health Check wi Parameter Log length Health Check failedprejob Overview ric C fio Health Check Configuration 1 ealth a New Health Check Configuration s
42. Computing Inc 90 Node Provisioning 6 3 1 Requesting A Node Certificate Each node communicates with the CMDaemon on the head node using a certificate If no certificate is found it automatically requests one from CMDaemon running on the head node Figure 6 5 Figure 6 5 Certificate Request certificate auto signing By default certificate auto signing means the cluster management daemon automatically issues a certificate to any node that requests a certificate For untrusted networks it may be wiser to approve certificate requests manually to prevent new nodes being added automatically without get ting noticed Disabling certificate auto signing can then be done by issu ing the autosign off command from cert mode in cmsh Section 3 3 has more information on certificate management in gen eral Example Disabling certificate auto sign mode mycluster cert autosign on mycluster cert autosign off off mycluster cert autosign off mycluster certificate storage and removal implications After receiving a valid certificate the node installer stores it in cm node installer certificates lt node mac address gt on the head node This directory is NFS exported to the nodes but can only be ac cessed by the root user The node installer does not request a new certifi cate if it finds a certificate in this directory valid or invalid If an invalid certificate is received the screen displays a
43. Groups A Users amp Groups Workload Management 2 Monitoring Configuration Disk Usage Network interfaces i Authorisation Mountpoint Used iz Use Interface RX TX B Authentication I 1731 G8 218 26 GB etho 32 55 MB 33 4MB eh 16792 MB 263 58 MB EVENT VIEWER Q eo All Events v Time amp Custer wi Soure w Message R Figure 5 4 Head Node Overview vices in the operating sequence to avoid power surges on the infrastruc ture The delay period may be altered using cmsh s d delay flag The Overview tab of a PDU object Figure 5 5 allows power opera tions on PDU ports by the administrator directly All ports on a particular PDU can have their power state changed or a specific PDU port can have its state changed 5 2 2 Power Operations Through cmsh All power operations in cmsh are done using the power command in device mode Some examples of usage are now given e Powering on node001 and node018 Example mycluster device power n node001 node018 on apeli I sciscelataceacnens ON node001 apcO2 8 soroas sertis ON node018 e Powering off all nodes in the slave category with a 100ms delay between nodes some output elided Example mycluster device power c slave d 0 1 off apcO1 1 OFF J node001 apcisi ace cave wes OFF J node002 BPE 238 oas OFF J node953 Bright Computing Inc 80 Power Mana
44. H 3 Actions And Their Parameters 299 H 3 Actions And Their Parameters H 3 1 Actions Table H 3 1 List Of Actions Name Description Drain node killprocess Power off Power on Power reset Reboot SendEmail Shutdown test action Undrain node Allows no new processes on a compute node from the workload manager Usage Tip Plan for undrain from another node becoming ac tive Kills a process with KILL 9 signal Powers off hard Powers on hard Power reset hard Reboot via the system trying to shut every thing down cleanly and then start up again Sends mail using the mailserver that was set up during server configuration Format sendemail somebody example com De fault destination is root localhost Power off via system trying to shut every thing down cleanly An action script example for users who would like to create their own scripts The source has helpful remarks about the envi ronment variables that can be used as well as tips on configuring it generally Allow processes to run on the node from the workload manager standalone scripts not built ins Located in directory cm local apps cmd scripts actions H 3 2 Parameters For Actions Actions have the parameters indicated by the left column in the example below Example myheadnode gt monitoring gt actions show drainnode Parameter Command Description Name Run on Timeout isCustom
45. Information overview screen described next Bright Cluster Manager Installer Kernel Modules m Kernel Modules E Ce 4 lej Figure 2 4 Kernel Modules Recommended For Loading After Probing Hardware Overview The Hardware Information screen Figure 2 5 provides an overview of detected hardware depending on the kernel modules that have been loaded If any hardware is not detected at this stage the Go Back button is used to go back to the Kernel Modules screen Figure 2 4 to add the appropriate modules and then the Hardware Information screen is re turned to to see if the hardware has been detected Clicking Continue in this screen leads to the Nodes configuration screen described next Bright Computing Inc 10 Installing Bright Cluster Manager Bright Clus Hardware Information English US an u Type Device Model R Hardware Inio Keyboard dev inputfevento AT Translated Set 2 keyboard Mouse dev input mice ImPS 2 Generic Wheel Mouse Disks dev sda VMware Virtual S CD DVD ROM idev hde VMware Virtual IDE CDROM Drive Network Interface etho Ethernet network interface H Network Interface ethl Ethernet network interface Network Interface eth2 Ethernet network interface Storage Controllers Floppy disk controller Storage Controllers VMWare virtualHW v3 Storage Controllers LSI Logic Symbios Logic 53 Storage Controllers IDE interface Memory 1GB Main Memory ee coma E conm
46. Introduction computations In addition to compute nodes larger clusters may have other types of nodes as well e g storage nodes and login nodes Nodes can be easily installed through the network bootable node provision ing system that is included with Bright Cluster Manager Every time a compute node is started the software installed on its local hard drive is synchronized automatically against a software image which resides on the head node This ensures that a node can always be brought back to a known state The node provisioning system greatly eases compute node administration and makes it trivial to replace an entire node in the event of hardware failure Software changes need to be carried out only once in the software image and can easily be undone In general there will rarely be a need to log on to a compute node directly In most cases a cluster has a private internal network which is usu ally built from one or multiple managed Gigabit Ethernet switches The internal network connects all nodes to the head node and to each other Compute nodes use the internal network for booting data storage and interprocess communication In more advanced cluster setups there may be several dedicated networks Note that the external network which could be a university campus network company network or the Inter net is not normally directly connected to the internal network Instead only the head node is connected to the the ex
47. Mbytes sec 0 1000 0 78 0 00 1 1000 1 08 0 88 2 1000 1 07 1 78 4 1000 1 08 3 53 8 1000 1 08 7 06 16 1000 1 16 13 16 32 1000 1 17 26 15 64 1000 1 17 52 12 128 1000 1 20 101 39 Bright Computing Inc 70 Configuring The Cluster 256 1000 1 37 177 62 512 1000 1 69 288 67 1024 1000 2 30 425 34 2048 1000 3 46 564 73 4096 1000 7 37 530 30 8192 1000 11 21 697 20 16384 1000 21 63 722 24 32768 1000 42 19 740 72 65536 640 70 09 891 69 131072 320 125 46 996 35 262144 160 238 04 1050 25 524288 80 500 76 998 48 1048576 40 1065 28 938 72 2097152 20 2033 13 983 71 4194304 10 3887 00 1029 07 All processes entering MPI_Finalize To run on different nodes than node001 and node002 the nodes file must be modified to contain different hostnames To perform a more extensive run the PingPong argument should be omitted 4 5 Configuring Switches and PDUs 4 5 1 Configuring With The Manufacturer s Configuration Interface Network switches and PDUs that will be used as part of the cluster should be configured with the PDU switch configuration interface described in the PDU switch documentation supplied by the manufacturer Typically the interface is accessed by connecting via a web browser or telnet to an IP address preset by the manufacturer The IP settings of the PDU switch must be configured to match the settings of the device in the cluster management software 4 5 2 Configuring SNMP Moreover in order to allow the clus
48. Monitoring Thresholds Display e State Flapping The first selection box decides what action to launch if state flapping is detected The next box is a plain text entry box that allows a parameter to be passed to the action The third box is a selection box again which decides when to launch the action depending on which of these following states is set Enter if the flapping has just started That is the current sam ple is in a flapping state and the previous sample was not ina flapping state During if the flapping is ongoing That is the current and previous flapping sample are both in a flapping state Leave if the flapping has just stopped That is the current sample is not in a flapping state and the previous sample was in a flapping state Metric Configuration Thresholds Options The Metric Configuration tab of Figure 10 16 also has a Thresholds button associated with a selected metric Thresholds are defined and their underlying concepts are discussed in section 10 2 3 The current section describes the configuration of thresh olds In the basic example of section 10 1 CPUUser was configured so that if it crossed a threshold of 50 it would run an action the killallyes script The threshold configuration was done using the Thresholds but ton of cmgui Clicking on the Thresholds button launches the Thresholds display window which lists the thresholds set for that metric Figure 10 18 which corresponds to
49. Name Hour Length 2000 Interval 3600 Time Offset 0 Kind Average Figure 10 21 cmgui Metric Configuration Consolidators Edit Dialog The Edit and Remove buttons in this display edit and remove a se lected consolidator from the list of consolidators while the Add button in this display adds a new consolidator to the list of consolidators The Edit and Add dialogs for a consolidator prompt for the following values Figure 10 21 e Name the consolidator s name By default Day Hour Month are al ready set up with appropriate values for their corresponding fields Length the number of intervals that are logged for this consolida tor Not to be confused with the metric log length Interval the time period in seconds associated with the consol idator Not to be confused with the metric interval time period For example the default consolidator with the name Hour has a value of 3600 Time Offset The time offset from the default consolidation time To understand what this means consider the Log length of the metric which is the maximum number of raw data points that the metric stores When this maximum is reached the oldest data point is removed from the metric data when a new data point is added Each removed data point is gathered and used for data consolida tion purposes For a metric that adds a new data point every Sampling interval seconds the time traw gone Which is how many seconds int
50. Nodes gt Chassis Slave Nodes Software image o gt E een at gt E GPU Units mics gt E Other Devices Reinstall nodes gt Node Groups AB Users amp Groups Workload Management Workload IA Monitoring Configuration Authorization B Authentication Set category X Set Figure 6 18 cmgui Provisioning Log Button For A Device Resource 6 3 8 Writing Network Configuration Files In the previous section the local drive of the node is synchronized ac cording to install mode settings with the software image from the provi sioning node The node installer now sets up configuration files for each configured network interface These are files like etc sysconfig network scripts ifcfg eth0O for Red Hat Scientific Linux and Centos while SuSE would use etc sysconfig network ifcfg eth0d These files are placed on the local drive When the node installer finishes its remaining tasks sections 6 3 9 6 3 13 it brings down all network devices and hands over control to the local sbin init process Eventually a local init script uses the network configuration files to bring the interfaces back up 6 3 9 Creating A Local etc fstab File The etc fstab file on the local drive contains local partitions on which filesystems are mounted as the init process runs The actual drive layout is configured in the category configuration or the node configuration so the node installer is able to generate and place a valid lo
51. Number of valid frames received but dis carded by the forwarding process Number of frames discarded due to an exces sive size Total number of good packets received and directed to a multicast address Total number of well received packets longer than 1518 octets Total number of packets received which are less than 64 octets long Switch uptime Cluster wide core usage in idle tasks Cluster wide core usage in system mode Cluster wide core usage in user mode Cluster wide total memory used Total number of nodes Cluster wide total swap used System uptime Total used space by a mount point Total number of milliseconds spent by all writes Total number of writes completed success fully The average time in milliseconds for I O re quests issued to device sda to be served sample_gpu sample_ilo Number of input datagrams to be forwarded The number of IP datagram fragments gener ated Number of IP datagrams which needed to be fragmented but could not Number of IP datagrams successfully frag mented Number of input datagrams discarded be cause the IP address in their header was not a valid address continues Bright Computing Inc H 1 Metrics And Their Parameters 293 Table H 1 1 List Of Metrics continued Name Description ipInDelivers Total number of input datagrams success fully delivered ipInDiscards Number of input IP datagrams discarded ipInHdrErrors Number of input
52. Packages 157 9 3 Managing Packages Inside Images 158 94 Kernel Updates sisine tea ae he eg Pele eee ale poate 159 9 5 Creating Custom Software Images 159 Cluster Monitoring 163 10 1 A Basic Example Of How Monitoring Works 163 10 2 Monitoring Concepts And Definitions 167 10 3 Monitoring Visualization With Cmgui 171 10 4 Monitoring Configuration With Cmgui 176 10 5 Overview Of Monitoring Data 190 10 6 Event Viewer lt 4 s a Soane sae Seer ee oe Bae ea 191 10 7 Monitoring Modes With Cmsh 192 Day to day Administration 207 TEI Parallel Shelli s ena ca Pi whe Oe Ale eu Bee EE a 207 11 2 Disallowing User Logins To Nodes 208 11 3 Getting Help With Bugs And Other Issues 208 PVA Backups sextet e s dirias e cd of ig E SEa a seat Widee SoS 210 11 5 BIOS Configuration and Updates 211 11 6 Hardware Match Check 0 0 0 00000 213 Third Party Software 215 12 1 Modules Environment 0 00002 eee 215 122 Shorewall t i 24 040 ues Be E Ee ht el ee 215 12 3 COMPLET ynag ce 2h E gate hae elec E ESEA ata oa are SEN 216 Table of Contents 13 12 4 Intel Cluster Checker 0 0 00 00 0004 2S CUAL Aaa dune denterd sa a e geS Sorat ehh a ap eae IZo TES the aso sts Ai eB A cd cd ted AG A tea ik Rbk es ty bh St hts High Availability 13 HA Concep
53. Software Images Node Categories Asiave v Head Nodes bright51 v Chassis v Gi Slave Nodes E node001 E node002 v GPU units gt ij Other Devices Health Check w DevicelsUp ManagedSer mounts rogueprocess ssh2node A Monitoring Configuration Health Check Configuration slave Parameter wi Log length 3000 3000 3000 3000 3000 Health Check Parameter Log length Sampling interval Gap size Threshold duration Fail severity Unknown severit Options Pass action Fail action Bright 5 1 Cluster failedprejob 3000 prejob Prejob 10 V Store Disabled T Only when idle Drain node x enter Undrain node gt emer Figure 8 15 Configuring A Prejob Healthcheck Via cmgui Bright Computing Inc Software Image Management Since Bright Cluster Manager is built on top of an existing Linux distribu tion the administrator must use distribution specific utilities for software package management For Bright Cluster Manager related packages a separate package management infrastructure has been set up which is described in this chapter 9 1 Bright Cluster Manager RPM Packages Bright Cluster Manager relies on the RPM Package Manager rpm to man age its software packages An example of such an RPM package is mpich ge gcc 64 1 2 7 40_cm5 1 x86_64 rpm The file name has the folowing structure packag
54. Support host page locked memory mapping Yes Compute mode Default Concurrent kernel execution No Device has ECC support enabled No Device is using TCC driver mode No deviceQuery CUDA Driver CUDART CUDA Driver Version 3 20 CUDA Runtime Version 3 20 NumDevs 2 Device Tesla T10 Processor Device Tesla T10 Processor PASSED The CUDA user manual has further information on how to run com pute jobs using CUDA 12 5 3 Verifying OpenCL CUDA 3 2 also contains an OpenCL compatible interface To verify that the OpenCL is working the oclDeviceQuery utility can be built and exe cuted Example root cuda test module add cuda32 toolkit root cuda test cd CUDA_SDK OpenCL root cuda test OpenCL make clean root cuda test OpenCL make root cuda test OpenCL bin linux release oclDeviceQuery oclDeviceQuery exe Starting OpenCL SW Info CL_PLATFORM_NAME NVIDIA CUDA CL_PLATFORM_VERSION OpenCL 1 0 CUDA 3 2 1 OpenCL SDK Revision 7027912 OpenCL Device Info 2 devices found supporting OpenCL CL_DEVICE_NAME Tesla T10 Processor CL_DEVICE_VENDOR NVIDIA Corporation CL_DRIVER_VERSION 260 19 21 CL_DEVICE_VERSION OpenCL 1 0 CUDA CL_DEVICE_TYPE CL_DEVICE_TYPE_GPU CL_DEVICE_MAX_COMPUTE_UNITS 30 Bright Computing Inc 226 Third Party Software CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS CL_DEVICE_MAX_WORK_ITEM_SIZES CL_DEVICE_MAX_WORK_GROUP_SIZE CL_DEVICE_MAX_CLOCK_FREQUENCY CL_DEVICE_
55. Switches Cluster name My Cluster v Networks externainet Timezone America Los_Angeles A Users amp Groups Workload Manage bA Monitoring Configu Authorization Slave name node Slave digits 3 v B Authentication Racks 1 X Height 42u X Bottom of rack is position 1 Name server 192 168 101 1 Et Time server pool ntp org m Ca Search domain clustervision com Figure 4 6 Cluster Settings 4 2 3 Configuring External Network Parameters After both internal and external networks are defined it may be necessary to change network parameters from their original installation settings Changing Head Node Hostname Normally the name of a cluster is used as the hostname of the head node To reach the head node from inside the cluster the alias master may be used at all times Setting the hostname of the head node itself to master is not recommended To change the hostname of the head node the device object corre sponding to the head node must be modified In cmgui the device listed under Head Nodes in the resource tree is selected and its Settings tab selected from the tabbed pane Figure 4 7 The hostname is changed by modifying the Hostname property and clicking on Save When setting a hostname a domain is not included The hostname of the head node can also be changed via cmsh Example root mycluster cmsh mycluster device use master mycluster gt device mycluster
56. Value lt built in gt Remove a node from further use by the scheduler Drain node master 5 no The meanings of these parameters are Command For a standalone metric script it is the full path For a built in the value cannot be set and the command is simply the name of the metric Bright Computing Inc 300 Metrics Health Checks And Actions Description Description of the metric Empty by default Name The name given to the metric Run on The node it will run on For standalone actions it is usually a choice of head node or the non head node For non head nodes the action will run from the node that triggered it if the node has sufficient permission to do that Timeout After how many seconds the command will give up retrying Default value is 5 seconds isCustom Is this a standalone script Bright Computing Inc Metric Collections This appendix gives details on metric collections In section 10 4 4 metric collections are introduced and how to add a metric collections script with cmgui is described This appendix covers how to add a metric collections script with cmsh It also describes the output specification of a metric collections script along with example outputs so that a metric collections script can be made by the administrator 1 1 Metric Collections Added Using Cmsh A metric collections script responsiveness is added in the monitoring metrics mode just like any othe
57. a reseller or system integrator then the first line of support is provided by the reseller or system integrator The reseller or system integrator in turn contacts the Bright Computing support department if 2 or 3 4 level support is required 11 3 2 Getting Support From Bright Computing If the Bright Cluster Manager software was purchased directly from Bright Computing then support bright comput ing com can be contacted for all levels of support In the bug report it is helpful to include as many details as possible to ensure the development team is able to reproduce the bug The policy at Bright Computing is to welcome such reports to provide feedback to the reporter and to work towards resolving bugs Bright Computing Inc 11 3 Getting Help With Bugs And Other Issues 209 Bright Computing provides the cm diagnose and the request remote assistance utilities to help resolve problems Reporting Cluster Manager Diagnostics With cm diagnose A diagnostic utility to help resolve bugs is cm diagnose To view its op tions it can be run as cm diagnose help If it is run without any options it runs interactively and allows the administrator to send the resultant diagnostics file to Bright Computing directly The output of a cm diagnose session looks something like this root bright51 cm diagnose Collecting kernel version Collecting top 5 processes Collecting cmsh commands Collecting network setup Collect
58. acknowledged status removed when the Unacknowledge button is clicked e Report to cluster vendor The report option is used for send ing an e mail about the selected event to the cluster vendor in case troubleshooting and support is needed The event viewer toolbar Figure 10 29 offers icons to handle event logs e detach event viewer Detaches the event viewer pane into its own window Reattachment is done by clicking on the reattachment event viewer icon that becomes available in the detached window e new event viewer filter dialog Loads or defines filters Fig ure 10 30 Filters can be customized according to acknowledge ment status time periods cluster nodes or message text The filter settings can be saved for later reloading e set event viewer filter dialog Adjusts an existing filter with a similar dialog to the new event viewer filter dialog Bright Computing Inc 192 Cluster Monitoring Title Show Acknowledged When n 05 Nov 2010 09 44 BA O8 Nov 2010 08 44 BA e 3 days z Cluster Ali Clusters X Node All Nodes hd Message Filter Load Save Ok Close Figure 10 30 cmgui Monitoring Event Viewer Filter Dialog e acknowledge event Sets the status of one or more selected events in the log to acknowledged They are then no longer seen unless the filter setting for the show acknowledged checkbox is checked in the set event filter option 10 7 Monitoring Modes With Cmsh This
59. again as instructed by the bootable hard drive This can be a useful fallback option that works around certain BIOS features or problems Besides network boot a node can also be configured to boot entirely from its drive When nodes boot from the network in simple clusters the head node supplies them with a known good state during node start up The known good state is maintained by the administrator and is defined using a soft ware image that is kept in a directory of the filesystem on the head node Supplementary filesystems such as home are served via NFS from the head node by default For a diskless node the known good state is copied over from the head node after which the node becomes available to cluster users For a disked node by default the hard disk contents on specified lo cal directories of the node are checked against the known good state on the head node Content that differs on the node is changed to that of the known good state After the changes are done the node becomes avail able to cluster users The process of getting the software image onto the nodes and getting the nodes into a good state is called node provisioning and ensures that a node is always restored to a known good state before cluster users use it The details of node provisioning are described in this chapter 6 1 Provisioning Nodes In simple clusters node provisioning is done only by the head node More complex clusters can have several provisi
60. ager reseller A certificate will be faxed or sent back in response This certificate can then be handled further as described in option 2 Bright Computing Inc 4 1 Installing a License 57 Example root mycluster request license Product Key XXXXXX XXXXXX XXXXXX XXXXXX XXXXXX 000354 515786 112224 207440 186713 Country Name 2 letter code US State or Province Name full name California Locality Name e g city San Jose Organization Name e g company Bright Computing Inc Organizational Unit Name e g department Development Cluster Name My Cluster Private key data saved to cm local apps cmd etc cert key new MAC Address of primary head node bright51 for ethO 00 0C 29 87 B8 B3 Will this cluster use a high availability setup with 2 head nodes y N n Certificate request data saved to cm local apps cmd etc cert csr new Submit certificate request to http support brightcomputing com licensing Y n y Contacting http support brightcomputing com licensing License granted License data was saved to cm local apps cmd etc cert pem new Install license Y n n Use install license cm local apps cmd etc cert pem new to install the license Installing A License Referring to the example above If the prompt Install license was answered with a Y the default the install license script is run If the prompt was answered with a n then the install license script
61. all nodes that have direct access to the GPUs In most cases this means that the cuda32 driver and cuda32 1libs packages should be installed in a software image If the head nodes also contain GPUs the cuda32 driver and cuda32 libs Bright Computing Inc 12 5 CUDA 223 packages should also be installed on the head nodes Note that as a re sult of package dependencies the cuda32 1ibs package are also installed on the head node The reason for this is because the files are needed for compilation Installing the cuda32 driver and cuda32 libs packages to a software image also causes several X11 related packages to be installed Example On a cluster where some of the nodes contain GPUs but the head node does not contain a GPU the following commands are issued on the head node to install the packages through YUM yum install cuda32 toolkit cuda32 sdk cuda32 profiler yum installroot cm images default image install cuda32 driver cuda32 libs The cuda32 driver package provides an init script which is executed at boot time to load the CUDA driver Because the CUDA driver depends on the running kernel the script compiles the CUDA driver on the fly and subsequently loads the module into the running kernel This cuda32 driver can also be loaded on the fly by calling the init script Loading the driver also causes a number of diagnostic kernel mes sages to be logged Example root mycluster etc init d cuda32 drive
62. and clicking on the arrow keys Clicking on the Recreate Initrd button runs the createramdisk command 6 3 Node Installer After the kernel has started up and the ramdisk kernel modules are in place on the node the node launches the node installer The node installer interacts with CMDaemon on the head node and takes care of the rest of the boot process Once the node installer has completed its tasks the local drive of the node has a complete Linux sys tem The node installer ends by calling sbin init from the local drive and the boot process then proceeds as a normal Linux boot The steps the node installer goes through for each node are 1 requesting a node certificate deciding or selecting node configuration starting up all network interfaces determining install mode type and execution mode running initialize scripts checking partitions mounting filesystems synchronizing the local drive with the correct software image writing network configuration files to the local drive eo ON TD a FF Q N creating an etc fstab file on the local drive installing GRUB bootloader if configured ja m running finalize scripts m N unloading specific drivers no longer needed 13 switching the root device to the local drive and calling sbin init These 13 node installer steps and related matters are described in de tail in the corresponding sections 6 3 1 6 3 13 Bright
63. arch x86_64 gt lt package image slave name autoconf arch noarch gt Additional packages to be installed in the image can be specified in the package selection file The package selection file also contains entries for the packages that can be installed on the head node image master Therefore non head node packages must have the image slave attribute Bright Computing Inc 10 Cluster Monitoring The Bright Cluster Manager monitoring framework lets a cluster admin istrator e inspect monitoring data to the required level for existing resources e configure gathering of monitoring data for new resources e see current and past problems or abnormal behavior e notice trends that help the administrator predict likely future prob lems e handle current and likely future problems by triggering alerts taking action if necessary to try to improve the situation or to investigate further Powerful features are accessible within an intuitive monitoring frame work and customized complex setups can be constructed to suit the re quirements of the administrator In this chapter the monitoring framework is explained with the fol lowing approach 1 A basic example is first presented in which processes are run on anode These processes are monitored and are acted on when a threshold is exceeded 2 With this easy to understand example as the base the various fea tures and associated functionality of
64. are e The initialize script section 6 3 5 This may run several stages before the finalize script e The imageupdate_initialize and imageupdate_finalize scripts which may run when the imageupdate command runs section 6 5 2 6 3 12 Unloading Specific Drivers Many kernel drivers are only required during the installation of the node After installation they are not needed and can degrade node performance The IPMI drivers are an egregious example of this The IPMI drivers are required to have the node installer configure the IP address of any Bright Computing Inc 6 4 Node States 107 IPMI cards Once the node is configured these drivers are no longer needed but they continue to consume significant CPU cycles and power if they stay loaded To solve this the node installer can be configured to unload a spec ified set of drivers just before it hands over control to the local init process This is done by editing the removeModulesBeforeInit setting in the node installer configuration file cm node installer scripts node installer conf By default the IPMI drivers are placed in the removeModulesBeforeInit setting 6 3 13 Switching To The Local init Process At this point the node installer is done The node s local drive now con tains a complete Linux installation and is ready to be started The node installer hands over control to the local sbin init process which con tinues the boot process and starts all runl
65. based interactive session with node qrstat show status of advance reservations qrsub submit advanced reservation qselect select queues based on argument values qsh start sh interactive session with a node qstat show status of batch jobs and queues qsub submit new jobs see related qalter qresub qtcsh start csh based interactive session with a node Torque The following commands are used to manage Torque Torque resource manager commands qalter alter batch job qde1 delete batch job ghold hold batch jobs qr1s release hold on batch jobs qstat show status of batch jobs qsub submit job qmgr batch policies and configurations manager qenable enable input to a destination qdisable disable input to a destination trace job trace job actions and states Bright Computing Inc G 3 PBS Pro 287 Further information on these and other commands is available in the appropriate man pages and on line documentation at http www adaptivecomputing com resources docs The Torque administrator manual is online at http www adaptivecomputing com resources docs torque index php The following commands can be used in PBS Pro to view queues G 3 PBS Pro qstat qstat a qstat r qstat q qstat rn qstat i query queue status alternate form show show only only qstat u username show only running jobs available queues running jobs w list of allocated nodes idle jobs jobs for n
66. be invoked with metricconf CPUUser gt consolidators exit exit exit myheadnode gt monitoring gt setup MasterNode healthconf healthconf Bright Computing Inc 10 7 Monitoring Modes With Cmsh 205 Alternatively the healthconf submode with the masternode device category could also have been reached from cmsh s top level prompt by executing monitoring setup healthconf masternode The health checks set to do sampling in the device category masternode are listed Example myheadnode gt monit oring gt setup MasterNode gt healthconf list HealthCheck HealthCheck Param Check Interval DeviceIsUp 120 ManagedServicesOk 120 cmsh 1800 exports 1800 failedprejob 900 failover 1800 ldap 1800 mounts 1800 mysql 1800 The use command would normally be used to drop into the health check object However use can also be an alternative to the list com mand since tab completion suggestions to the use command will get a list of currently configured health checks for the masternode too The add command adds a health check into the device category The list of all possible health checks that can be added to the category can be seen with the command monitoring healthchecks list or more con veniently simply with tab completion suggestions to the add command At the end of section 10 7 2 a script called cpucheck was built This script was part of a task to use health checks instead of metric threshold actions t
67. be removed then the 2 references to it first need to be removed and the device also first has to be brought to the closed state by using the close command Example mycluster gt device usedby apc01 Device used by the following Type Name Parameter Device apc01 Device is up Device node001 powerDistributionUnits Device node002 powerDistributionUnits mycluster gt device Working With Objects validate Whenever committing changes to an object the cluster management in frastructure checks the object to be committed for consistency If one or more consistency requirements are not met then cmsh reports the vio lations that must be resolved before the changes are committed The Bright Computing Inc 3 6 Cluster Management Shell 47 validate command allows an object to be checked for consistency with out committing local changes Example mycluster gt device use node001 mycluster gt device node001 clear category mycluster gt device node001 commit Code Field Message 1 category The category should be set mycluster gt device node001 set category slave mycluster gt device node001 validate All good mycluster gt device node001 commit mycluster gt device node001 3 6 4 Accessing Cluster Settings The management infrastructure of Bright Cluster Manager is designed to allow cluster partitioning in the future A cluster partition can be viewed as a virtual cluster i
68. be viewed for each node via cmgui or cmsh Queue submission and scheduling daemons normally run on the head node From cmgui their states are viewable by clicking on the node folder in the resources tree then on the node name item and selecting the Services tab Figure 10 5 Bright Computing Inc 140 Workload Management The job execution daemons run on compute nodes Their states are viewable by clicking on the Slave Nodes folder then on the node name item and selecting the Services tab From cmsh the services states are viewable from within device mode using the services command One liners from the shell to illustrate this are output elided Example root bright51 cmsh c device services node001 status sgeexecd UP root bright51 cmsh c device services master status sge UP 8 4 Configuring And Running Individual Workload Managers Bright Cluster Manager deals with the various choices of workload man agers in as generic a way as possible This means that not all features of a particular workload manager can be controlled so that fine tuning must be done through the workload manager configuration files Workload manager configuration files that are controlled by Bright Cluster Manager should normally not be changed directly because Bright Cluster Manager will overwrite them However overwriting can be prevented by setting the directive FreezeChangesTo lt workload manager gt Config in cmd
69. by a finalize script from the NFS drive to the local hard drive and placed alongside the output of a finalize script This is useful for comparison purposes after the node is fully running Writing to the drive of the local node means that the directory being written to may need to be added to the excludelistsyncinstall and or excludelistfullinstall exclude lists to prevent it being overwritten by a known good state directory during provisioning Bright Computing Inc Quickstart Installation Guide This appendix describes a basic installation of Bright Cluster Manager on a cluster as a step by step process Following these steps allows clus ter administrators to get a cluster up and running as quickly as possible without having to read the entire administrators manual References to chapters and sections are provided where appropriate F 1 Installing Head Node 1 Boot head node from Bright Cluster Manager DVD 2 Select Install Bright Cluster Manager in the boot menu 3 Once the installation environment has been started choose Normal installation mode and click Continue 4 Accept the License Agreements for Bright Cluster Manager and the Linux distribution and click Continue 5 Click Continue on kernel modules screen 6 Review the detected hardware and go back to kernel modules screen if additional kernel modules are required Once all relevant hard ware Ethernet interfaces hard drive and DVD drive is detected
70. clear and validate all work as outlined in the introduction to working with objects section 3 6 3 More detailed usage examples of these commands within a monitoring mode are given in Cmsh Monitoring Actions section 10 7 1 Adding a metric collections script to the framework is possible from this point in cmsh too Details on how to do this are given in appendix I 10 7 4 Cmsh Monitoring Setup The cmsh monitoring setup mode corresponds to the cmgui Metric Configuration and Health Check Configuration tabs of sections 10 4 2 and 10 4 3 The monitoring setup mode of cmsh like the Metric Configuration and the Health Check Configuration tabs of cmgui is used to select a device category Properties of metrics or of health checks can then be configured for the selected device category These properties are the configuration of the sampling parameters themselves for example frequency and length of logs but also the configuration of related properties such as thresholds consolidation actions launched when a metric threshold is crossed and actions launched when a metric or health state is flapping The setup mode only functions in the context of metrics or health checks and therefore these contexts under the setup mode are called submodes On a newly installed system a list command from the monitoring setup prompt displays the following account of metrics and health checks that are in use by device categories Example root myheadno
71. daemon running on the head node Cluster management appli cations never communicate directly with cluster management daemons running on non head nodes CMDaemon is an application that is started automatically when any node boots and will continue running until the node is shut down Should CMDaemon be stopped manually for whatever reason its cluster man agement functionality will no longer be available making it hard for administrators to manage the cluster However even with the daemon stopped the cluster will remain fully usable for running computational jobs The only route of communication with the cluster management dae mon is through TCP port 8081 The cluster management daemon ac cepts only SSL connections thereby ensuring all communications are en crypted Authentication is also handled in the SSL layer using client side X509v3 certificates see section 3 3 On the head node the cluster management daemon uses a MySQL database server to store all of its internal data Monitoring data is also stored in a MySQL database 3 7 1 Controlling The Cluster Management Daemon It may be useful to shut down or restart the cluster management daemon For instance a restart may be necessary to activate changes when the clus ter management daemon configuration file is modified The cluster man agement daemon operation can be controlled through the following init script arguments in etc init d cmd Init Script Operation Descrip
72. datagrams discarded due to errors in their IP headers ipInReceives Total number of input datagrams including ones with errors received from all interfaces ipInUnknownProtos Number of received datagrams but discarded because of an unknown or unsupported pro tocol ipOutDiscards Number of output IP datagrams discarded ipOutNoRoutes Number of IP datagrams discarded because no route could be found ipOutRequests Total number of IP datagrams supplied to IP in requests for transmission ipReasm0Ks Number of IP datagrams successfully re assembled ipReasmReqds Number of IP fragments received needing re assembly responsiveness sample_responsiveness sdt sample_sdt tcpCurrEstab Number of TCP connections for which the current state is either ESTABLISHED or CLOSE WAIT tcpInErrs Total number of IP segments received in error tcpRetransSegs Total number of IP segments retransmitted testcollection testmetriccollection testmetric testmetric udpInDatagrams Total number of UDP datagrams delivered to UDP users udpInErrors Number of received UDP datagrams that could not be delivered for other reasons no port excl udpNoPorts Total number of received UDP datagrams for which there was no application at the desti nation port util_sda Percentage of CPU time during which I O re quests were standalone scripts not built ins Located in directory cm local apps cmd scripts metrics Bright Computing Inc 294 Metri
73. default libdaploucm so 2 dapl 2 0 mlx4_0 2 If no InfiniBand is present the ifconfig command can be used to check which network interface is used If the ethO device is used then the fol lowing line is needed in dat conf file ofa v2 iwarp u2 0 nonthreadsafe default libdaplofa so 2 dapl 2 0 ethO 0 After installing the necessary packages and modifying the dat conf file of the software images the nodes need to be updated This can be done with an updateprovisioners command if there are node provi sioners in the cluster followed by an imageupdate command Bright Computing Inc 220 Third Party Software 12 4 2 Preparing Input Files Three runs of the Intel Cluster Checker are necessary for Intel Cluster Ready certification e Asa privileged cluster user i e root to generate files for the pack age test e Asa regular cluster user e Asa privileged cluster user i e root Each user requires an input file which is called a recipe Two other input files are important for certification the node list and the file exclude list These files are located in the home cmsupport intel cluster ready directory e recipe user ib xml e recipe user nonib xml e recipe root ib xml e recipe root nonib xml e nodelist e excludefiles Recipes The recipe user ib xml recipe user nonib xml recipe root ib xml and recipe root nonib xml files are default recipes that have been included as part of the cm config inte
74. default when first connecting to the cluster with cmgui anode a GPU unit and so on Neighbouring tabs often allow a closer look at issues noticed in the Overview and also sometimes a way to act on them For example if jobs are not seen in the Overview tab then the ad ministrator may want to look at the neighboring Services tab Fig ure 10 5 and see if the workload manager is running The Services tab allows the administrator to Start Stop Restart or Reload the service with the corresponding button if the backend init d script for the service supports these commands The Reset button is used to clear a Failed state of a service as seen by the monitoring sys tem the monitoring system sets the state of a service to the failed state if 20 restarts of the service in a row fail Eile Monitoring View Help RESOURCES es bright51 E Bright5 1 Cluster vS My Clusters Figure 10 5 cmgui Device Services Tab 10 3 Monitoring Visualization With Cmgui The Monitoring option in the menu bar of cmgui item 1 in Figure 10 4 launches an intuitive visualization tool that should be the main tool for getting a feel of the system s behavior over periods of time With this tool the measurements and states of the system are viewed Graphs for metrics and health checks can be looked at in various ways for example the graphs can be zoomed in and out on over a particular time period the graphs can be laid out on
75. device pexec n node001 node002 cd 1s e In cmgui it is executed from the Parallel Shell tab after selecting the cluster from the resource tree Figure 11 1 File Monitoring View Help Bright 5 1 Cluster Overview Settings Failover Rackview Health Parallel shell License v0 Switches amp switch01 6 node001 amp switcho2 Eg switcho3 anaconda ks cfg Networks install log install log syslog externalnet yum cron 5 internalnet va Power Distribution Units v0 Software Images 2 default image v Node Categories amp nodeoo2 slave vial ap el anaconda ks cfg E5 bright51 install log vE chassis install log syslog Slave Nodes yum cron nodeoo1 E nodeoo2 v GPU units vi Other Devices va Node Groups Figure 11 1 Executing Parallel Shell Commands Bright Computing Inc 208 Day to day Administration 11 2 Disallowing User Logins To Nodes Users run their computations on the cluster by submitting jobs to the workload management system However workload management is only effective if cluster users do not run jobs outside the workload manage ment system This can be enforced as a policy by disabling user logins to the nodes from a head node by adding a line like the following to the etc ssh sshd_conf ig file in the node image Example AllowUsers root bright51 cm cluster node cm cluster In the example the domain cm cluster or
76. displaying the hostname ethernetswitch and ip properties for each object Example mycluster gt device list f hostname 14 ethernetswitch 15 ip hostname key ethernetswitch ip apc01 10 142 254 1 mycluster switch01 46 10 142 255 254 node001 switch01 47 10 142 0 1 node002 switch01 45 10 142 0 2 switch01 10 142 253 1 mycluster gt device Without an argument the default format string for the mode is used To display the default format string the format command without pa rameters is used Invoking the format command without arguments also displays all available properties including a description To change the default format string the desired format string can be passed as an argu ment to format Working With Objects append removefrom When dealing with a property of an object that can take more than one value at a time a list of values the append and removefrom commands can be used to respectively append to and remove elements from the list Bright Computing Inc Cluster Management with Bright Cluster Manager However the set command may also be used to assign a new list at once In the following example values are appended and removed from the powerdistributionunits properties of device node001 The powerdi stributionunits properties represents the list of ports on power distri bution units that a particular device is connected to This information is relevant when power operations are performed on a no
77. domain what the cluster uses as its domain e and NTP time servers used to synchronize the time on the cluster with standard time Changing external IP parameters of a cluster therefore requires mak ing changes in the settings of the two objects above This can be done in cmgui by using the associated settings tabs as specified in the above Changing the external network parameters of a cluster can also be done as follows using cmsh Example mc network use externalnet mc gt network externalnet set baseaddress 192 168 1 0 mc gt network externalnet set netmaskbits 24 mc gt network externalnet 4 set gateway 192 168 1 1 mc gt network externalnet commit mc gt network externalnet device use master mc gt devicel mc interfaces mc gt device mc gt interfaces use etht mc gt device mc gt interfaces ethi set ip 192 168 1 176 mc gt device mc gt interfaces ethi commit mc gt device mc gt interfaces ethi partition use base mc gt partition base set nameservers 192 168 1 1 mc gt partition base set searchdomains x com y com mc gt partition base append timeservers ntp x com mc gt partition base commit mc gt partition base After changing network configurations a reboot of the head node is necessary to activate the changes To make the cluster use DHCP to obtain its external network set tings the IP address and baseaddress of externalnet are
78. e The imageupdate_finalize script runs after an imageupdate command is run on that node and right after the node image has updated These differ from the initialize section 6 3 5 and finalize sec tion 6 3 11 scripts because they run on nodes that are fully up rather than on nodes that are booting so they are able to access a fully running sys tem 6 6 Adding New Nodes 6 6 1 Adding New Nodes With cmsh and cmgui Add Functions Node objects can be added from within the device mode of cmsh by run ning the add command Example bright51 gt device add slavenode node002 bright51 gt device node002 commit The cmgui equivalent of this is to go within the Slave Nodes resource and after the Overview tabbed pane for the Slave Nodes resource comes up to click on the Add button Figure 6 21 Ele Monitoring View Help RESOURCES a Slave Nodes E Bright 5 1 Cluster v i Node Categories slave v Head Nodes E bright51 b Chassis Overview Prams I Tasks Network Semp Node ldenuncaton Wizard l Modified Hostname NA MAC b Category v IP vB node001 00 0C 29 80 6E 7E slave 10 141 0 1 E node001 gt E GPU Units Figure 6 21 Node Creation Wizard Button When adding the node objects in cmsh and cmgui some values IP addresses for example may need to be filled in before the object validates Adding new node objects as placeholders can also be done from cmsh or cmgui
79. every time a change has been applied to a software image the updateprovisioners command in the cmsh softwareimage mode has to be executed to propagate changes to the other provisioning nodes Alternatively in CMGUI the Update Provisioning Nodes button in the Provisioning Nodes tab may be pressed when the Software Images folder is selected in the resource tree Although it is possible to configure any service to migrate from one head node to another in the event of a failover in a typical HA setup only the following services will be migrated e NFS e Workload Management e g SGE Torque Maui 13 1 2 Network Interfaces Each head node in an HA setup typically has at least an external and an internal network interface each configured with an IP address In addition an HA setup involves two virtual IP interfaces which migrate in the event of a failover the external shared IP address and the internal shared IP address In a normal HA setup both shared IP addresses are hosted on the head node that is operating in active mode When head nodes are also being used as login nodes users outside of the cluster are encouraged to use the shared external IP address for con necting to the cluster This ensures that they will always reach whichever head node is active Similarly inside the cluster slave nodes will use the shared internal IP address wherever possible for referring to the head node For example slave nodes mount NFS filesystems on the
80. external LDAP server onto the head node hence keeping all cluster authentication local and making the presence of the external LDAP server unnecessary except for updates This opti mization is described in the next section 7 3 1 External LDAP Server Replication This section explains how to set up replication for an external LDAP server to an LDAP server that is local to the cluster if improved LDAP ser vices are needed Section 7 3 2 then explains how this can then be made to work with a high availability setup Typically the Bright LDAP server is configured as a replica consumer to the external LDAP server provider with the consumer refreshing its local database at set timed intervals How the configuration is done varies according to the LDAP server used The description in this section as sumes the provider and consumer both use OpenLDAP Bright Computing Inc 126 User Management External LDAP Server Replication Configuring The Provider It is advisable to back up any configuration files before editing them The provider is assumed to be an external LDAP server and not nec essarily part of the Bright cluster The LDAP TCP ports 389 and 689 may therefore need to be made accessible between the consumer and the provider by changing firewall settings If a provider LDAP server is already configured then the following syn chronization directives must be in the slapd conf file to allow replica tion index e
81. field beside the Uplink label More uplinks can be appended by clicking on the widget The state is saved with the Save button Ei ile Monitoring View Help 7205 My Clusters vE Bright 5 1 Cluster gt 9 Power Distribution Units RESOURCES Sag switchod eee eee Settings Members s Notes t E Bright 5 1 Cluster v Switches IP 10 141 1 Network internanet w 3 g switch02 E switch03 Uplink 115 zr b G Networks Uplink 16 b E Software Images ij Node Categories s AA Figure 4 8 Notifying CMDaemon About Uplinks With cmgui e In cmsh the switch is accessed from the device mode The uplink port numbers can be appended one by one with the append com mand or set in one go by using space separated numbers Example root bright51 cmsh bright51 device bright51 gt device set switch01 uplinks 15 16 bright51 gt device set switch02 uplinks 01 bright51 gt device commit successfully committed 3 Devices 4 5 4 The showport MAC Address Port Matching Tool The showport command can be used in troubleshooting network topol ogy issues as well as checking and setting up new nodes section 6 3 2 Basic Use Of showport In the device mode of cmsh is the showport command which works out which ports on which switch are associated with a specified MAC ad dress Example root bright51 cmsh bright51 device bright51 gt device showport 00
82. flag To exclude the kernel from the list of packages that should be updated the following command can thus be used yum exclude kernel update If a package e g kernel is to be excluded permanently from all YUM updates it can be appended to the space separated exclude list option for a repository configuration Repository configuration files are located in the etc yum repos d directory An updated kernel in a node image is not used until it is explicitly enabled with either cmgui or cmsh To enable it in cmgui the Software Images resource is selected and the specific image item is selected The Settings tabbed pane for that par ticular software image is opened the new kernel version is selected from the Kernel version drop down menu and the Save button is clicked Saving the version builds a new initial ramdisk To enable the updated kernel from cmsh the softwareimage mode is used The kernelversion property of a specified software image is then set 9 5 Creating Custom Software Images By default the node image used to boot non head nodes is based on the same version and release of the Linux distribution as used by the head node However sometimes an image based on a different distribution or a different release from that on the head node may be needed Creating a working node image consists of two steps The first step is to create a base distribution archive from an installed base host The second step is to create the
83. found in STDIN cmflocal apps cmd scripts actions killprocess p Sexternalnet Power off Power off the device lt built in gt Bimemainet F F on a vind Bewet Onkun ower on ower on the device lt built in gt Powerreset Power resetthe device lt built in gt Gi Software Images Reboot Reboot the node lt built in gt default image SendEmail Send an email to the address specified by the parameter in the mon lt built in gt v Node Categories mS Shutdown Shutdown the node lt built in gt fill Head Nodes CE B E aa gnan aaa n Em eae aaa GB myheadnode drain node Enable a node to start running jobs for the scheduler lt built in gt b Chassis v Slave Nodes Figure 10 28 cmgui Monitoring Main Actions Tab What the listed actions on a newly installed system do are described in appendix H 3 1 The remove revert and save buttons work as described for metrics in section 10 4 4 The edit and add buttons start up dialogs to edit or add options to action parameters Action parameters are described in appendix H 3 2 10 5 Overview Of Monitoring Data These views are set up under the Overview tab for various devices within a resource They are a miscellany of monitoring views based on the monitored data for a particular device The views are laid out as part of an overview tab for that device which can be a switch cluster node GPU unit and so on When first connecting to a cluster with cmgui the overview tab
84. ibnet Domain name ib cluster External network false Base address 10 149 0 0 Netmask bits 16 Broadcast address 10 149 255 255 Once the network has been created all nodes must be assigned an In finiBand interface on this network The easiest method of doing this is to create the interface for one node device and then to clone that device several times For large clusters a labor saving way to do this is using the addinterface command section 4 3 1 as follows root bright51 echo device gt addinterface n node001i node150 pysical ibO ibnet 10 149 0 1 commit cmsh x When the head node is also equipped with an InfiniBand HCA it is important that a corresponding interface is added and configured in the cluster management infrastructure Example Assigning an IP address on the InfiniBand network to the head node mc gt device mc gt interfaces add physical ib0O mc gt device mc gt interfaces ib0 set network ibnet mc gt device mc gt interfaces ib0 set ip 10 149 255 254 mc gt device mc gt interfaces ib0 commit Bright Computing Inc 4 4 Configuring InfiniBand Interfaces 69 As with any change to the network setup the head node needs to be restarted to make the above change active 4 4 4 Verifying Connectivity After all nodes have been restarted the easiest way to verify connectivity is to use the ping utility Example Pinging node015 while logged in to node014 through
85. indicate that handing over control to the local init process failed or the local init process was not able to start the CMDaemon on the node Lastly this state can be entered when the previous state was INSTALLER_REBOOTING and the reboot takes too long e INSTALLER_UNREACHABLE This state is entered from the INSTALLING state when the head node CMDaemon can no longer ping the node Bright Computing Inc 108 Node Provisioning It could indicate the node has crashed while running the node installer e INSTALLER_REBOOTING In some cases the node installer has to re boot the node to load the correct kernel Before rebooting it sets this state If the subsequent reboot takes too long the head node CMDaemon sets the state to INSTALLER_FAILED 6 5 Updating Running Nodes 6 5 1 Updating Running Nodes excludelistupdate Changes made to the contents of the head node s software image for nodes become part of the provisioning system according to its housekeep ing system Section 6 1 4 The image is then installed from the provision ing system onto a regular node when it the regular node reboots via a provisioning request Section 6 3 7 However updating a running node with the latest changes from the software image is also possible without rebooting it Such an update can be requested using cmsh or cmgui and is queued and delegated to a pro visioning node just like an ordinary provisioning request Like the provisioning requ
86. is to run several instances of the standard unix utility yes The yes com mand sends out an endless number of lines of y texts It is usually used to answer prompts for confirmation 8 subshell processes are run in the background from the command line on the head node with yes output sent to dev nul11 as follows for i in 1 8 do yes gt dev null amp done Running mpstat 2 shows usage statistics for each processor updating every 2 seconds It shows that user which is user mode CPU usage and which is reported as CPUUser in the Bright Cluster Manager metrics is close to 100 on an 8 core or less head node when the 8 subshell processes are running Setting Up The Kill Action To stop the pointless CPU intensive yes processes the command killall yes is used It is made a part of a script killallyes bin bash killall yes and made executable with a chmod 700 killallyes For convenience it may be placed in the cm local apps cmd scripts actions directory where other action scripts also reside 10 1 2 Using The Framework Now that the pieces are in place cmgui s monitoring framework is used to add the action to its action list and then set up a threshold level that triggers the action Bright Computing Inc 10 1 A Basic Example Of How Monitoring Works 165 File Monitoring View Help RESOURCES A Monitoring Configuration My Cluster v H Other Devi
87. like cmsh c device use node001 set installmode FULL commit By default the install mode property is auto linked to the prop erty set for install mode for that category of node Since the property for that node s category defaults to AUTO the property Bright Computing Inc 6 3 Node Installer 101 for the install mode of the node configuration defaults to AUTO Category 5 The install mode property of the node s category This can be set using cmgui Figure 6 13 or using cmsh with a one liner like cmsh c category use slave set installmode FULL commit As already mentioned in a previous point the install mode is set by default to AUTO 6 A dialog on the console of the node Figure 6 16 gives the user a last opportunity to overrule the install mode value as determined by the node installer By default it is set to AUTO Figure 6 16 Install Mode Setting Option During Node Installer Run 6 3 5 Running Initialize Scripts An initialize script is used when custom commands need to be executed before checking partitions and mounting devices For example to initial ize some unsupported hardware or to do a RAID configuration lookup for a particular node In such cases the custom commands are added to an initialize script An initialize script can be added to both a node s category and the node configuration The node installer first runs an initialize script if it exists from
88. made and Bright Computing Inc 3 6 Cluster Management Shell 45 the modified command confirms that no local changes exist Finally the get command reconfirms that no local changes exist Some properties are booleans For these the values yes 1 on and true are equivalent to each other as are their opposites no 0 off and false These values are case insensitive Working With Objects clear Example mycluster gt device set node101 mac 00 11 22 33 44 55 mycluster gt device get node101 mac 00 11 22 33 44 55 mycluster gt device clear node101 mac mycluster gt device get node101 mac 00 00 00 00 00 00 mycluster gt device The get and set commands are used to view and set the MAC ad dress of node101 without running the use command to make node101 the current object The clear command then unsets the value of the property The result of clear depends on the type of the property it acts on In the case of string properties the empty string is assigned whereas for MAC addresses the special value 00 00 00 00 00 00 is assigned Working With Objects list format The list command is used to list all device objects The f flag takes a format string as argument The string specifies what properties are printed for each object and how many characters are used to display each property in the output line In following example a list of objects is requested
89. monitoring along with definitions of terms used is appropriate at this point The features of the monitoring framework covered later on in this chapter will then be understood more clearly 10 2 1 Metric In the basic example of section 10 1 the metric value considered was CPUUser measured at regular time intervals of 120s A metric is a property of a device that can be monitored It has a numeric value and can have units unless it is unknown i e has a null value Examples are e temperature value in degrees Celsius for example 45 2 C e load average value is a number for example 1 23 e free space value in bytes for example 12322343 A metric can be a built in which means it is an integral part of the moni toring framework or it can be a standalone script The word metric is often used to mean the script or object associated with a metric as well as a metric value The context makes it clear which is meant 10 2 2 Action In the basic example of section 10 1 the action script is the script added to the monitoring system to kill all yes processes The script runs when the condition is met that CPUUser crosses 50 An action is a standalone script or a built in command that is executed when a condition is met This condition can be e health checking section 10 2 4 e threshold checking section 10 2 3 associated with a metric sec tion 10 2 1 e state flapping section 10 2 9 10 2 3 Threshold In t
90. more information on obtaining a new license Bright Computing Inc Generated Files Section 3 7 3 describes how system configuration files on all nodes are written out This appendix contains a list of all system configuration files which are generated automatically All of these files may be listed as Frozen Files in the Cluster Management Daemon configuration file to prevent them from being generated automatically see section 3 7 3 and Appendix C Bright Computing Inc 250 Generated Files Files generated automatically on head nodes File Generated By Method Comment etc resolv conf CMDaemon Entire file etc localtime CMDaemon Entire file etc exports CMDaemon Section etc fstab CMDaemon Section etc hosts CMDaemon Section etc hosts allow CMDaemon Section tfitpboot mtu conf CMDaemon Entire file etc sysconfig ipmicfg CMDaemon Entire file etc sysconfig network config CMDaemon Section SuSE only etc sysconfig network routes CMDaemon Entire file SuSE only etc sysconfig network ifcfg CMDaemon Entire file SuSE only etc sysconfig network dhcp CMDaemon Section SuSE only etc sysconfig network CMDaemon Entire file RedHat only etc sysconfig network scripts ifcfg CMDaemon Entire file RedHat only etc dhclient conf CMDaemon Entire file RedHat only etc dhcpd conf Entire file etc dhcpd Entire file slavenet conf etc shorewall interfaces CMDaemon Section etc shorewall masq CMDaemon Section etc
91. networking types are TCP IP over Eth ernet and InfiniBand 12 6 2 Server Implementation Lustre servers MDS and OSSs run on a patched kernel The patched kernel kernel modules and software can be installed with RPM pack ages The Lustre server software can also be compiled from source but the kernel needs to be patched and recreated Lustre supports one kernel version per Lustre version To use Lustre with Bright Cluster Manager a Lustre server image and a Lustre client image are installed onto the head node so that they can provision the Lustre nodes Creating The Lustre Server Image To create a Lustre server image a clone is made of an existing software image for example from default image In cmgui this is done by selecting the Software Images resource to bring up the Overview tabbed pane display Selecting the image to clone and then clicking on the Clone button prompts for a confirmation to build a clone image Figure 12 1 Bright Computing Inc 228 Third Party Software Eile Monitoring View Help RESOURCES v5 My Clusters vE Bright 5 1 Cluster gt E Switches Overview Modified Software Images N E Bright 5 1 Cluster gt Networks slave b Head Nodes gt G Chassis Gj Slave Nodes E node001 gt E GPU Units Figure 12 1 cmgui Cloning An Image Alternatively cmsh on the head node can create a clone image Example root my
92. new cluster and enter the following parameters Host Hostname or IP address of the cluster Certificate Click Browse and browse to the certificate file Password Password entered during installation Click on the connect button see figure 3 3 Optional For more information on how the Cluster Management GUI can be used to manage one or more clusters consult section 3 4 Your cluster should now be ready for running compute jobs For more information on managing the cluster please consult the appropriate chapters in this manual Please consult the User Manual provided in cm shared docs cm user manual pdf for more information on the user environment and how to start jobs through the workload management system Bright Computing Inc Workload Managers Quick Reference G 1 Sun Grid Engine Sun Grid Engine SGE is a workload management system that was origi nally made available under an Open Source license by Sun Microsystems It forked off into various versions in 2010 and its future is unclear at the time of writing Bright Cluster Manager 5 1 uses SGE version 6 2 up date 5 which was the last release from Sun Microsystems and remains in widespread use SGE services should be handled using CMDaemon as explained in section 8 3 However SGE can break in obtuse ways when implementing changes so the following notes are sometimes useful in getting a system going again e The sge_qmaster daemon on the head node can be st
93. of node001 node002 and so on On the Bright head server after chrooting to the lt image gt directory with chroot cm images lt itmage gt the kadmin shell is entered For each regular node in the image the fol lowing keytab command is run ktadd host lt nodenumber gt cm cluster Bright Computing Inc 7 5 Tokens And Profiles 131 7 4 3 Configuring PAM The system auth service is configured in etc pam d system auth with the following rules added auth sufficient pam_krb5 so use_first_pass account default bad success ok user_unknown ignore pam_krb5 password sufficient pam_krb5 so use_authtok session optional pam_krb5 so Similar entries will exist for LDAP authentication which if left in there will allow users to either authenticate against LDAP or against Kerberos LDAP authentication can be disabled by removing the lines including pam_ldap so thereby allowing users to only authenticate with Kerberos 7 5 Tokens And Profiles Tokens are used to assign capabilities to users who are grouped accord ing to their assigned capabilities A profile is the name given to each such group A profile thus consists of a set of tokens The profile is stored as part of the authentication certificate generated to run authentication op erations to the cluster manager for the certificate owner Authentication is introduced earlier in section 3 3 The certificate can be generated within cmsh by using the createcert ifi
94. of the cluster is the default view The overview tab is also the default view first Bright Computing Inc 10 6 Event Viewer 191 time a device within a resource is clicked on in a cmgui session Of the devices the cluster head node and regular node resources have a relatively extensive overview tab with a pre selected mix of informa tion from monitored data For example in Figure 10 4 a head node is shown with an overview tab presenting memory used CPU usage disk usage network statistics running processes and health status Some of these values are presented with colors and histograms to make the infor mation easier to see 10 6 Event Viewer This is a log view of events on the cluster s The logs can be handled and viewed in several ways wi Message s started on myheadnode Movip ou My Cluster myheadnode Service ntpd died on myheadnade X Event Details ooo Ti Acknowledge Event Set Event Filter New Event Viewer Detach Event Viewer Event Viewer Toolbar Unacknowledge Report to cluster vendor Ready Figure 10 29 cmgui Monitoring Event Viewer Pane Double clicking on an event row starts up an Event Details dialog Figure 10 29 with buttons to e Acknowledge or Unacknowledge the event as appropriate Clicking on Acknowledge will remove the event from the event view unless the Show Acknowledged checkbox has been checked Any visible acknowledged events will have their
95. only e CentOS 5 x86_64 only e SuSE Enterprise Server 11 x86_64 only This chapter introduces some basic features of Bright Cluster Manager and describes a basic cluster in terms of its hardware 1 2 Cluster Structure In its most basic form a cluster running Bright Cluster Manager contains e One machine designated as the head node e Several machines designated as compute nodes e One or more possibly managed Ethernet switches e One or more power distribution units Optional The head node is the most important machine within a cluster be cause it controls all other devices such as compute nodes switches and power distribution units Furthermore the head node is also the host that all users including the administrator log in to The head node is the only machine that is connected directly to the external network and is usually the only machine in a cluster that is equipped with a monitor and keyboard The head node provides several vital services to the rest of the cluster such as central data storage workload management user management DNS and DHCP service The head node in a cluster is also frequently referred to as the master node A cluster typically contains a considerable number of non head or regular nodes often also referred to as slave nodes Most of these nodes are compute nodes Compute nodes are the ma chines that will do the heavy work when a cluster is being used for large Bright Computing Inc
96. or rather when the cluster management daemon is starting the Init dead time is used rather than the Dead time to determine whether the other node is alive Mount script The script pointed to by the Mount script setting is responsible for bring ing up and mounting the shared filesystems Unmount script The script pointed to by the Unmount script setting is responsible for bringing down and unmounting the shared filesystems Quorum time When a node is being asked what head nodes it is able to reach over the network the node has a certain time within which it must respond If a node does not respond to a quorum within the configured Quorum time it is no longer considered for the results of the quorum Secondary master The Secondary master setting is used to define the secondary head node to the cluster 13 3 5 Re cloning a Head Node After an HA setup has gone into production it may become necessary to re install one of the head nodes at some point This would be necessary if one of the head nodes was replaced due to hardware failure To re clone a head node out of an existing active head node enter the cmha setup select Failover Status and subsequently View clone installation instructions Then follow the instructions as displayed on the screen i e repeat the instructions in section 13 2 2 Note that if the MAC address of one of the head nodes has changed it is typically necessary to request a new license See section 4 1 3 for
97. parameters for the virtual shared external IP address By selecting Create the shared external interface will be created ol Configure the hostname and internal and external primary network interfaces for the secondary head node fom The primary head node may have other network interfaces e g In finiBand interfaces IPMI interface alias interface on the IPMI net work These interfaces will also be created on the secondary head node but the IP address of the interfaces will need to be configured For each such interface when prompted configure a unique IP ad dress for the secondary head node 7 Configure the dedicated failover network that will be used between the two head nodes for heartbeat monitoring 8 Assign a network interface and IP address on both head nodes that will be used for the dedicated failover network 13 2 2 Cloning After the parameters have been configured in the Preparation stage the secondary head node should be cloned from the primary head node This procedure may also be repeated later on if a head node ever needs to be replaced e g as a result of defective hardware Bright Computing Inc 13 2 HA Set Up Procedure 241 10 11 Boot the secondary head node off the internal cluster network It is highly recommended that the primary and secondary head nodes have identical hardware configurations In the Cluster Manager PXE Environment menu before the time out of 5s expires select
98. provisioning tasks are allowed to run on the entire cluster This is set using the MaxNumber0fProvision Bright Computing Inc 6 1 Provisioning Nodes 87 ingThreads directive in the head node s CMDaemon configuration file etc cmd conf as described in Appendix C A provisioning request is deferred if the head node is not able to im mediately allocate a provisioning node for the task Whenever an on going provisioning task has finished the head node tries to re allocate deferred requests Provisioning Role Change Notification With updateprovisioners Whenever updateprovisioners is invoked the provisioning system waits for all running provisioning tasks to end before updating all images lo cated on any provisioning nodes by using the images on the head node It also re initializes its internal state with the updated provisioning role properties i e keeps track of what nodes are provisioning nodes The updateprovisioners command can be accessed from the softwareimage mode in cmsh It can also be accessed from cmgui Fig ure 6 3 File Monitoring View Help RESOURCES T5 Software Images E Bright 5 1 Cluster TTET O ovco sene v Bright 5 1 Cluster b Switches v Networks Provisioning subsystem status idle externalnet Update of provisioning nodes requested no ibnet Maximum number of nodes provisioning 10000 internalnet Nodes currently provisioning 0 ipminet Nodes waiting to b
99. queues associated with a scheduler Example root bright51 cmsh bright51 jobqueue bright51 gt jobqueue list Type Name sge all q torque default torque hydroq torque longq torque shortq e qstat lists statistics for the queues associated with a scheduler Bright Computing Inc 8 6 Using cmsh With Workload Management 149 Example bright51 gt jobqueue qstat SSSSSSSSSSSSSSSSSSSSSS gge SHSSssssssssssssssssssss Queue Load Total Used Available all q 0 1 1 0 1 SSSSSSSSSSSSsssssssss torque SS SSsSsssssssssssssss Queue Running Queued Held Waiting default 0 0 0 0 hydroq 1 0 0 0 longq 0 0 0 0 shortq 0 0 0 0 SSSssssssssssssssssss pbspro sSsSsssssssssssssss Queue Running Queued Held Waiting e listpes lists the parallel environment available for schedulers Example some details elided bright51 gt jobqueue listpes Scheduler Parallel Environment sge make sge mpich sge openmpi_ib e scheduler sets the scheduler submode Example bright51 gt jobqueue scheduler torque Working scheduler is torque bright51 gt jobqueue torque The submode can be unset using scheduler jobqueue Mode In cmsh The scheduler Submode If a scheduler submode is set then commands under jobqueue mode operate only on the queues for that particular scheduler For example within the torque submode of jobqueue mode the list command will show only the queues for torque Example bright51 gt jo
100. set to 0 0 0 0 The gateway address the nameserver s and the IP address of the ex ternal address are then obtained via DHCP Timeserver configuration for externalnet is not picked up from the DHCP server having been set during installation Figure 2 20 and is changed manually if so desired using cmgui as in Figure 4 6 or using cmsh in partition mode as in the above example Bright Computing Inc 4 3 Configuring IPMI Interfaces 65 4 3 Configuring IPMI Interfaces Bright Cluster Manager also takes care of the initialization and configura tion of the baseboard management controller BMC that may be present on devices The IPMI or iLO interface that is exposed by a BMC is treated in the cluster management infrastructure as a special type of net work interface belonging to a device In the most common setup a ded icated network i e IP subnet is created for IPMI communication The 10 148 0 0 16 network is used by default for IPMI interfaces by Bright Cluster Manager 4 3 1 Network Settings The first step in setting up IPMI is to add the IPMI network as a net work object in the cluster management infrastructure The procedure for adding a network was described in section 4 2 2 The following settings are recommended as defaults Property Value Name ipminet Domain name ipmi cluster External network false Base address 10 148 0 0 Netmask bits 16 Broadcast address 10 148 255 255 Once the network has b
101. started from the autoexec bat file Note that DOS text files require a carriage return at the end of every line Example cp cmosprog com mnt bin echo e A cmosprog com r gt gt mnt autoexec bat After making the necessary changes to the DOS image it is unmounted umount mnt After preparing the DOS image it is booted as described in sec tion 11 5 3 11 5 2 Updating BIOS Upgrading the BIOS to a new version involves using the DOS tools that were supplied with the BIOS Similar to the instructions above the flash tool and the BIOS image must be copied to the DOS image The file autoexec bat should be altered to invoke the flash utility with the correct parameters In case of doubt it can be useful to boot the DOS image and invoke the BIOS flash tool manually Once the correct parameters have been determined they can be added to the autoexec bat After a BIOS upgrade the contents of the NVRAM may no longer represent a valid BIOS configuration because different BIOS versions may store a configuration in different formats It is therefore recommended to also write updated NVRAM settings immediately after flashing a BIOS image see previous section The next section describes how to boot the DOS image 11 5 3 Booting DOS Image To boot the DOS image over the network it first needs to be copied to software image s boot directory and must be world readable Example cp flash img cm images default image boot bios flas
102. the Bright Cluster Manager monitoring framework are described and discussed in depth These include visualization of data concepts configuration monitoring customization and cmsh use 10 1 A Basic Example Of How Monitoring Works In this section a minimal basic example of monitoring a process is set up The aim is to present a simple overview that covers a part of what the monitoring framework is capable of handling The overview gives the reader a structure to keep in mind around which further details are fitted and filled in during the coverage in the rest of this chapter Bright Computing Inc 164 Cluster Monitoring In the example a user runs a large number of pointless CPU intensive processes on a head node which is normally very lightly loaded An administrator would then want to monitor user mode CPU load usage and stop such processes automatically when a high load is detected Fig ure 10 1 CPU load High load detected and processes stopped CPU intensive processes started Time Figure 10 1 Monitoring Basic Example CPU intensive Processes Started Detected And Stopped The basic example illustrates a very contrived way for the Bright Cluster Manager monitoring framework to be used to do that 10 1 1 Before Using The Framework Setting Up The Pieces Running A Large Number Of Pointless CPU Intensive Processes One way to simulate a user running pointless CPU intensive processes
103. the InfiniBand in terconnect root node014 ping node015 ib cluster PING node015 ib cluster 10 149 0 15 56 84 bytes of data 64 bytes from node015 ib cluster 10 149 0 15 icmp_seq 1 tt1 64 time 0 086 ms If the ping utility reports that ping replies are being received the In finiBand is operational The ping utility is not intended to benchmark high speed interconnects For this reason it is usually a good idea to perform more elaborate testing to verify that bandwidth and latency are within the expected range The quickest way to stress test the InfiniBand interconnect is to use the Intel MPI Benchmark IMB which is installed by default in cm shared apps imb current The setup sh script in this directory can be used to create a template in a user s home directory to start a run Example Running the Intel MPI Benchmark using openmpi to evaluate perfor mance of the InfiniBand interconnect between node001 and node002 root mycluster su cmsupport cmsupport mycluster cd cm shared apps imb current cmsupport mycluster current setup sh cmsupport mycluster current cd BenchMarks imb 3 2 cmsupport mycluster 3 2 module load openmpi gcc cmsupport mycluster 3 2 module initadd openmpi gcc cmsupport mycluster 3 2 make f make_mpi2 cmsupport mycluster 3 2 mpirun np 2 machinefile nodes IMB MPI2 PingPong Benchmarking PingPong processes 2 P Se ee ee bytes repetitions t usec
104. the basic example of section 10 1 shows a Thresholds dis play window with a threshold named killallyesthreshold configured for the metric CPUUser The Edit and Remove buttons in this display edit and remove a se lected threshold from the list of thresholds while the Add button adds a Bright Computing Inc 10 4 Monitoring Configuration With Cmgui 181 Name killallyesthreshodd Bound 5 ound value gt bound up Bound type lower bound value lt bound Severity 10 Action killallyes X Enter w P Figure 10 19 cmgui Metric Configuration Thresholds Edit Dialog new threshold to the list The Edit and Add dialogs for a threshold prompt for the following values Figure 10 19 e Name the threshold s name e Bound the metric value which demarcates the threshold Bound type If checked the radio button for upper bound places the threshold zone above the bound lower bound places the threshold zone below the bound e Severity A value assigned to indicate the severity of the situation if the threshold is crossed It is 10 by default Severity is discussed in section 10 2 6 e Action The action field types decide how the action should be trig gered and run The field types are from left to right script a script selected from a drop down list of available actions parameter optional what parameter value to pass to the ac tion when w
105. the directory specified with option b lt image gt d Install distribution packages Examples of usage are Example If the base distribution is in tmp BASEDIST tar gz then the command that creates the node image cm image new image is cm create image x tmp BASEDIST tar gz b cm image new image Example If the contents of basehost64 were rsynced to an existing directory cm image new image then no extraction is needed and the command to create a node image is then simply cm create image b cm image new image This creates an image with the name new image in the CM daemon database Example The same image with a different name bio image can be created with the n option cm create image b cm image new image n bio image Example The d flag is used to make the utility install distribution specific pack ages into the image cm create image b cm image new image n bio image d Package selection files are used from cm local apps cluster tools config If the base distribution of the node image being created is CentOS5 then the config file used is cm local apps cluster tools config CENTOS5 config dist xml The package selection file is made up of a list of XML elements speci fying the name of the package architecture and image type For example lt package image slave name apr arch x86_64 gt lt package image slave name apr util arch x86_64 gt lt package image slave name atk devel
106. the node As with the preceding option files and directories that are being synchronized on the node lose their original contents The exclude lists are passed to rsync using its exclude from option The syntax of an exclude list can be quite involved The rsync man ual page specifically the INCLUDE EXCLUDE PATTERN RULES section gives details on how such a list is built A cmsh one liner to get an exclude list is for a category is cmsh c category use slave get excludelistfullinstall Similarly to set the list cmsh c category use slave set excludelistfullinstall commit where a text editor opens up to allow changes to be made to the list Fig ure 6 17 illustrates how the setting can be modified via cmgui File Monitoring View Help RESOURCES jy slave Bright 5 1 cluster Networks eae Settings Services externalnet Exclude list full install ibnet internalnet ipminet gt E Power Distribution Units Software Images default image Browse v0 Node Categories dh slave v i Head Nodes Exclude list sync install 7 E bright51 lost found p a X b E Chassis v i Slave Nodes E node001 lost found Gg node002 gt E GPU Units Browse y Ready b C Other Devices b E Node Groups 5 Figure 6 17 Setting up exclude lists with cmgui for provisioning interface used to receive image data provisioninginterface For nodes with
107. the properties of group printer back to their last committed state and not affect other objects Example mycluster gt user printer refresh mycluster gt user printer show Parameter Value Group ID 503 Group members maureen Group name printer Here the user maureen reappears because she was stored in the last save Also because only the group printer object has been committed the asterisk indicates the existence of other uncommitted modified ob jects 7 2 5 Removing A User This corresponds roughly to the functionality of the Remove button oper ation in section 7 1 The remove command will remove a user or group The useful r flag added to the end of the username will remove the user s home di rectory too For example within user mode the command remove user maureen r commit will do a removal of user maureen along with her home directory Or continuing the session at the end of section 7 2 4 from where it was left off Example mycluster gt user printer use user maureen mycluster gt user maureen remove r commit mycluster gt user ls d home grep maureen no maureen left behind mycluster gt user 7 3 Using An External LDAP Server When using an external LDAP server to serve the user database a Bright cluster can be configured in different ways to authenticate against it Bright Computing Inc 7 3 Using An External LDAP Server 125 For sm
108. the use of this docu ment Limitation of Liability and Damages Pertaining to Bright Computing Inc The Bright Cluster Manager product principally consists of free software that is licensed by the Linux authors free of charge Bright Computing Inc shall have no liability nor will Bright Computing Inc provide any warranty for the Bright Cluster Manager to the extent that is permitted by law Unless confirmed in writing the Linux authors and or third par ties provide the program as is without any warranty either expressed or implied including but not limited to marketability or suitability for a specific purpose The user of the Bright Cluster Manager product shall accept the full risk for the quality or performance of the product Should the product malfunction the costs for repair service or correction will be borne by the user of the Bright Cluster Manager product No copyright owner or third party who has modified or distributed the program as permitted in this license shall be held liable for damages including gen eral or specific damages damages caused by side effects or consequential damages resulting from the use of the program or the un usability of the program including but not limited to loss of data incorrect processing of data losses that must be borne by you or others or the inability of the program to work together with any other program even if a copyright owner or third party had been advised about the possib
109. to retrieve a MAC value for custompowerscriptargument with a bash script as shown in the previous section and then pass the argument via 3 such as done in the command wakeonlan 3 Instead custompowerscript can simply call wakeonlan CMD_MAC directly in the script when run as a power operation command from within CMDaemon 5 1 5 Hewlett Packard iLO Based Power Control If Hewlett Packard is chosen as the node manufacturer during instal lation and the nodes have an iLO management interface then Hewlett Packard s iLO management package hponcfg is installed by default on the nodes and head nodes The hponcfg package is in the Bright Cluster Manager rpm repository so it is easily upgraded if needed for more recent hardware The installa tion is done on the head node the node image and in the node installer as follows yum install hponcfg yum install hponcfg installroot cm images default image yum install hponcfg installroot cm node installer To use iLO over all nodes the following steps are done 1 The iLO interfaces of all nodes are set up like the IPMI interfaces outlined in section 5 1 2 Bright Cluster Manager treats HP iLO in terfaces just like regular IPMI interfaces 2 The ilo_power p1 custom power script must be configured on all nodes This can be done with a cmsh script For example for all nodes in the slave category Example mycluster device foreach c slave set custompowerscript
110. top of each other or the graphs can be laid out as a giant grid The graph scale settings can also be adjusted stored and recalled for use the next time a session is started An alternative to cmgui s visualization tool is the command line cmsh This has the same functionality in the sense that data values are selected and studied according to configurable parameters with it The data val ues can even be plotted and displayed on graphs with cmsh with the help of unix pipes and graphing utilities However the strengths of moni Bright Computing Inc 172 Cluster Monitoring toring with cmsh lie elsewhere cmsh is more useful for scripting or for examining pre decided metrics and health checks rather than a quick vi sual check over the system This is because cmsh needs more familiarity with options and is designed for text output instead of interactive graphs Monitoring with cmsh is discussed in section 10 7 How cmgui is used for visualization is now described 10 3 1 The Monitoring Window The Monitoring menu is selected from the menu bar of cmgui and a clus ter name is selected The Monitoring window opens Figure 10 6 The resources in the cluster are shown on the left side of the window Clicking on a resource opens or closes its subtree of metrics and health checks The subsequent sections will describe ways of viewing and changing resource settings After having carried out such modifications saving and loading a setti
111. tructure will take care of operating all ports of a device in the correct order when a power operation is done on the device It is also possible for multiple devices to share the same PDU port This is the case for example when twin nodes are used i e two nodes sharing a single power supply In this case all power operations on one device will apply to all nodes sharing the same PDU port In cmgui the Overview tab of a PDU Figure 5 2 provides an overview of the state of PDU ports and devices that have been associated with each port 5 1 2 IPMI Based Power Control IPMI based power control relies on the baseboard management controller BMC inside a node It is therefore only available for node devices Blades inside a blade chassis typically use IPMI for power management For details on setting up networking and authentication for IPMI inter faces see section 4 3 To carry out IPMI based power control operations the Power controlled by property in Figure 5 1 must be set to the IPMI interface through which power operations should be relayed Normally this IPMI interface is ipmiO Any list of configured APC PDU ports displayed in the GUI is ignored by default when the Power controlled by property is not apc Example Configuring power parameters settings for a node using cmsh mycluster device use node001 mycluster gt device node001 set powerdistributionunits apc01 6 apc01 7 apc01 8 Bright Computing Inc
112. two versions are installed the 32 bit ver sion and the 64 bit i e EM64T version Both versions can be invoked through the same set of commands so the modules environment see sec tion 12 1 must be used to select one of the two versions For the C com piler the 32 bit and 64 bit modules are called intel cc and intel cce Bright Computing Inc 12 3 Compilers 217 respectively The modules for the Fortran compiler are called intel fc and intel fce The Intel compilers also include a debugger which can be used by loading the intel idb or intel idbe module The following commands can be used to run the Intel compilers and debugger e icc Intel C C compiler e ifort Intel Fortran 90 95 compiler e idb Intel Debugger Full documentation for the Intel compilers is availble athttp software intel com en us intel compilers 12 3 3 PGI High Performance Compilers Package name pgi The PGI compiler package contains the PGI C and Fortran 77 90 95 compilers pgcc PGI C compiler pgCC PGI C compiler pgf77 PGI Fortran 77 compiler e pgf90 PGI Fortran 90 compiler pgf95 PGI Fortran 95 compiler pgdbg PGI debugger Full documentation for the PGI High Performance Compilers is available at http www pgroup com resources docs htm 12 3 4 AMD Open64 Compiler Suite Package name open64 The Open64 Compiler Suite contains optimizing C and Fortran com pilers e opencc Open64 C compiler e openC
113. used For example 500 and onwards is used for regular UIDs in Red Hat whereas 1000 and onwards is used in SuSE e A home directory is created and a login shell is set Users with unset passwords cannot log in 3 Edit allows users to be modified via a dialog 4 Revert discards unsaved edits that have been made via the Edit button The reversion goes back to the last save 5 Remove removes selected rows of users By default along with their home directories Group management in cmgui is started by selecting the Groups tab in the Users amp Groups pane Clickable LDAP object entries for regular groups will then show up similar to the user entries already covered above Management of these entries is done with the same button func tions as for user management 7 2 Managing Users And Groups With cmsh This section goes through a session to cover the cmsh functions that corre spond to the user management functions of cmgui in the previous section These functions are run from within cmsh s user mode Example root mycluster cmsh mycluster user mycluster gt user 7 2 1 Adding A User This corresponds roughly to the functionality of the Add button operation in section 7 1 In user mode the process of adding a user maureen to the LDAP directory is started with the add command Example mycluster gt user add user maureen mycluster gt user maureen The cmsh helpfully drops into the context of th
114. v Bslave Idap 3000 1800 v v0 Head Nodes Managed 3000 120 v myheadnode mounts 3000 1800 v v Chassis mysql 3000 1800 v v Slave Nodes E node001 E3 node002 E3 node003 v GPU units v other Devices v i Node Groups A Users amp Groups Workload Management Monitoring Configuration 3 Authorization B Authentication Edit u Remove Revert Figure 10 22 cmgui Monitoring Health Check Configuration Display Af ter Category Selection The Save button saves as yet uncommitted changes made via the Add or Edit buttons The Revert button discards unsaved edits made via the Edit button The reversion goes back to the last save The Remove button removes a selected health check from the health checks listed The remaining buttons Edit and Add open up options dialogs These are now discussed Health Check Configuration The Main Tab s Edit And Add Options The Health Check Configuration tab of Figure 10 22 has Add and Edit buttons The Add button opens up a dialog to add a new health check to the list and the Edit button opens up a dialog to edit a selected health check from the list The dialogs are very similar to those of the Add and Edit options of Metric Configuration in section 10 4 2 The dialogs for the Health Check Configuration tab are as follows Figure 10 23 e Health Check The name of the health check e Parameter The values that the health check script is designed to handle For example
115. var spool cmd The SpoolDir directive specifies the directory which is used by the CM Daemon to store temporary and semi temporary files CMDaemondAudit Syntax CMDaemonAudit yes no Default CMDaemonAudit no When the CMDaemonAudit directive is set to yes and a value is set for the CMDaemon auditor file with the CMDaemonAuditorFile directive then CMDaemon actions are time stamped and logged in the CMDaemon au ditor file CMDaemondAuditorFile Syntax CMDaemonAuditorFile filename Bright Computing Inc 260 CMDaemon Configuration File Directives Default CMDaemonAuditorFile var spool cmd audit log The CMDaemonAuditorFile directive sets where the audit logs for CM Daemon actions are logged The log format is time stamp profile IP address action unique key Example Mon Jan 31 12 41 37 2011 Administrator 127 0 0 1 added Profile arb itprof 4294967301 DisableAuditorForProfiles Syntax DisableAuditorForProfiles profile profile Default DisableAuditorForProfiles node The DisableAuditorForProfiles directive sets the profile for which an audit log for CMDaemon actions is disabled A profile Section 3 3 3 de fines the services that CMDaemon provides for that profile user More than one profile can be set as a comma separated list Out of the profiles that are available on a newly installed system node admin cmhealth and readonly only the profile node is enabled by defau
116. when no IPMI network has been defined but nodes do have IPMI In this case the BOOTIF and ipmi0 interfaces have IP addresses assigned on the same network but if a different offset is entered for the ipmiO interface then the assigned IP address starts from the offset specified A different network can be selected for each interface using the drop down box in the Network column Selecting Unassigned disables a net work interface If the corresponding network settings are changed e g base address of the network the IP address of the head node interface needs to be mod ified accordingly If IP address settings are invalid an alert is displayed explaining the error Clicking Continue on a Network Interfaces screen validates IP ad dress settings for all node interfaces and if all setting are correct and if InfiniBand networks have been defined leads to the Subnet Managers screen Figure 2 15 described in the next section If no InfiniBand net works are defined or if InfiniBand networks have not been enabled on the networks settings screen then clicking Continue on this screen leads to the CD DVD ROMs selection screen Figure 2 16 Bright Computing Inc Installing Bright Cluster Manager Network Interfaces English US m etho internainet m 10 141 255 254 ethl externainet m DHCP eth2 Jnassignec pmio ipminet 0 254 eth0 0 pmine 55 Node Network Interface BOOTIF internainet X dadais
117. xs simpleType gt lt xs restriction base xs string gt lt xs enumeration value ext2 gt lt xs enumeration value ext3 gt lt xs enumeration value xfs gt lt xs restriction gt lt xs simpleType gt lt xs element gt lt xs element name mountPoint type xs string gt lt xs element name mountOptions type xs string default defaults gt lt xs sequence gt lt xs group gt lt xs complexType name raid gt lt xs sequence gt lt xs element name member type xs string minOccurs 2 max0ccurs unbounded gt lt xs element name level type xs int gt lt xs choice minOccurs 0 maxOccurs 1 gt lt xs group ref filesystem gt lt xs element name swap gt lt xs complexType gt lt xs element gt lt xs choice gt lt xs sequence gt lt xs attribute name id type xs string use required gt lt xs complexType gt lt xs complexType name volumeGroup gt lt xs sequence gt lt xs element name name type xs string gt lt xs element name extentSize type extentSize gt lt xs element name physicalVolumes gt lt xs complexType gt lt xs sequence gt lt xs element name member type xs string minOccurs 1 maxOccurs unbounded gt lt xs sequence gt lt xs complexType gt lt xs element gt lt xs element name logical
118. 0 141 0 0 v 10 141 128 0 10 141 143 255 externalnet 5 internalnet b Q Power Distributi gt 9 Software Images b C Node Categories b Head Nodes b a p Figure 4 2 Networks In the context of the OSI Reference Model each network object rep resents a layer 3 i e Network Layer IP network and several layer 3 networks can be layered on a single layer 2 network e g an Ethernet segment Selecting a network in the resource tree displays its tabbed pane By default the tab displayed is the Overview tab This gives a convenient overview of all IP addresses assigned in the selected network Figure 4 3 Bright Computing Inc 60 Configuring The Cluster File Monitoring View Help vie My Clusters VE wy Cluster Switches E switcho1 Nodes on this network amp 3 Switches on this network v Networks Hosmame Interface _ IP Hosmame iP e externainet myheadnode etho 10 141 255 254 switcho1 10 141 253 1 amp internainet gt 9 Power Distribution Units Gj Software Images default image gt O Node Categories v Head Nodes E myheadnode v Chassis node001 BOOTIF 10 1410 1 node002 BOOTIF 0 14102 gt Power Distribution Units on this network Hosmame IP Figure 4 3 Network Overview Selecting the Settings tab Figure 4 4 allows a number of network properties Figure 4 5 to be changed File Monitoring View Help
119. 13 55 32 10 141 0 1 node installer Failed to create disk layout Exit code 4 signal 0 Mar 24 13 55 32 10 141 0 1 node installer There was a fatal problem T his node can not be installed until the problem is corrected Figure 6 25 No Disk It is likely that this issue is caused by the correct storage driver not being loaded To solve this issue the correct kernel module should be added to the software image s kernel module configuration Experienced system administrators work out what drivers may be missing by checking the results of hardware probes For example the output of lspci provides a list of hardware detected in the PCI slots giv ing the chipset name of the storage controller hardware in this case Example root bright51 lspci grep SCSI 00 10 0 Serial Attached SCSI controller LSI Logic Symbios Logic SAS2 008 PCI Express Fusion MPT SAS 2 Falcon rev 03 The next step is to Google with likely search strings based on that output The Linux Kernel Driver DataBase LKDDb is a hardware database built from kernel sources that lists driver availability for Linux It is avail able at http cateee net lkddb Using the Google search engine s site operator to restrict results to the cateee net web site only a likely string to try might be Example SAS2008 site cateee net The search result indicates that the mpt2sas kernel module needs to be added to the node kernels A look in the mod
120. 4 set hostname foobar foobar gt device foobar commit Bright Computing Inc 4 2 Network Settings 63 Eile Monitoring View Help RESOURCES es Ovi Vey My Clusters H My Cluster v Switches EB switcho1 v Networks externalnet amp internalnet vj Power Distribution Units S4 apcOl v Software Images default image v DNode Categories Bslave Head Nodes mycluster v Slave Nodes E node001 mycluster erview Tasks Settings System Information Services Process Management Network Hostname Hardware tag MAC address Rack Ethernet switch Power controlled by node002 v Other Devices v Node Groups Custom power script Power Distribution Unit E My Cluster mycluster 0000000a000 1 v Position 1 v Height 1U X switchOL Port 3 X apc 3 apc01 z Port aaar Users amp Groups Workload Management IA Monitoring Configuration EVENT VIEWER Q o bdi Time Cluster vi Soure bd Message vi R Figure 4 7 Head Node Settings foobar gt device foobar quit root mycluster sleep 30 hostname f foobar cm cluster root mycluster Note the shell prompt still shows the hostname as mycluster because its prompt string is only set when a new shell is started Changing External Network Parameters When a cluster interacts with an external networ
121. 9 01 0F F8 slave 10 141 0 2 v Node Categories v node003 00 00 00 00 00 00 slave 10 141 0 3 Slave v node004 00 00 00 00 00 00 slave 10 141 0 4 ij Head Nodes v node005 00 00 00 00 00 00 slave 10 141 0 5 v node006 00 00 00 00 00 00 slave 10 141 0 6 b ene v node007 00 00 00 00 00 00 slave 10 141 0 7 v node008 00 00 00 00 00 00 slave 10 141 0 8 v node009 00 00 00 00 00 00 slave 10 141 0 9 E node001 v node010 00 00 00 00 00 00 slave 10 141 0 10 me renee v node011 00 00 00 00 00 00 slave 10 141 0 11 Es node003 Ea node004 Gg node005 X ee Figure 6 22 Node Creation Wizard 10 Placeholders Created The MAC addresses can be assigned to a node via the node identifi cation wizard However leaving nodes in a placeholder state where the MAC address entry is left unfilled means that any new node with an unassigned MAC address that is started up is offered a choice out of the created node names by the provisioning system at its console This hap pens when the node installer reaches the node configuration stage during node boot as described in section 6 3 2 This is sometimes preferable to associating the node name with a MAC address remotely The node creation wizard can set IP addresses for the nodes At one point in the dialog a value for IP offset can also be set Figure 6 23 Interface Network IP offset eth0 v 0 0 0 0 eth1 v 0 0 0 0 eth2 X o 0 0 To eth3 v
122. 9 59 2038 Version 5 1 Edition Advanced Licensed Nodes 3 Node Count 2 MAC Address a Oat aia a a et a Os a 51 centos5 The license in the example above allows just 3 nodes to be used It is not tied to a specific MAC address so it can be used anywhere For convenience the Node Count field in the output of licenseinfo shows the current number of nodes used 4 1 2 Verifying A License The verify license Utility The verify license utility is used to check licenses independent of whether the cluster management daemon is running When an invalid license is used the cluster management daemon can not start The license problem is logged in the cluster management dae mon logfile Example root myheadnode etc init d cmd start Waiting for CMDaemon to start CMDaemon failed to start please see log file Bright Computing Inc 4 1 Installing a License 55 root myheadnode tail 1 var log cmdaemon Dec 30 15 57 02 myheadnode CMDaemon Fatal License has expired but further information cannot be obtained with for example cmgui and cmsh because these clients themselves obtain their information from the cluster management daemon In such a case the verify license utility is meant for troubleshoot ing license issues using the following options The info option of verify license prints license details Example root myheadnode verify license Usage verify license lt path to certificate gt lt p
123. ADDRESS_BITS CL_DEVICE_MAX_MEM_ALLOC_SIZE CL_DEVICE_GLOBAL_MEM_SIZE CL_DEVICE_ERROR_CORRECTION_SUPPORT CL_DEVICE_LOCAL_MEM_TYPE CL_DEVICE_LOCAL_MEM_SIZE CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE CL_DEVICE_QUEUE_PROPERTIES CL_DEVICE_QUEUE_PROPERTIES CL_DEVICE_IMAGE_SUPPORT CL_DEVICE_MAX_READ_IMAGE_ARGS CL_DEVICE_MAX_WRITE_IMAGE_ARGS CL_DEVICE_SINGLE_FP_CONFIG CL_DEVICE_IMAGE lt dim gt CL_DEVICE_EXTENSIONS CL_DEVICE_COMPUTE_CAPABILITY_NV NUMBER OF MULTIPROCESSORS NUMBER OF CUDA CORES CL_DEVICE_REGISTERS_PER_BLOCK_NV CL_DEVICE_WARP_SIZE_NV CL_DEVICE_GPU_OVERLAP_NV CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV CL_DEVICE_INTEGRATED_MEMORY_NV CL_DEVICE_PREFERRED_VECTOR_WIDTH_ lt t gt 3 512 512 64 512 1296 MHz 32 1023 MByte 4095 MByte no local 16 KByte 64 KByte CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE CL_QUEUE_PROFILING_ENABLE 1 128 8 INF quietNaNs round to nearest 2D_MAX_WIDTH 4096 2D_MAX_HEIGHT 32768 3D_MAX_WIDTH 2048 3D_MAX_HEIGHT 2048 3D_MAX_DEPTH 2048 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 1 3 30 240 16384 32 CL_TRUE CL_FALSE CL_FALSE CHAR 1 SHORT 1 INT 1 LONG 1 FLOAT oclDeviceQuery Platform Name NVIDIA CUDA Platform V
124. AID 10 configurations Note that when RAID is used the administrator is responsible for ensuring that the cor rect kernel modules are loaded Normally including one of the following modules should be sufficient raid0O raidi raid4 raid5 raid6 lt xml version 1 0 encoding IS0 8859 1 7 gt lt diskSetup xmlns xsi http www w3 org 2001 XMLSchema instance xsi noNamespaceSchemaLocation schema xsd gt lt device gt lt blockdev gt dev sda lt blockdev gt lt partition id ai gt lt size gt 25G lt size gt lt type gt linux raid lt type gt lt partition gt lt device gt lt device gt lt blockdev gt dev sdb lt blockdev gt lt partition id bi gt lt size gt 25G lt size gt lt type gt linux raid lt type gt lt partition gt Bright Computing Inc D 6 Example Logical Volume Manager 273 lt device gt lt raid id ri gt lt member gt a1l lt member gt lt member gt b1 lt member gt lt level gt 1 lt level gt lt filesystem gt ext3 lt filesystem gt lt mountPoint gt lt mountPoint gt lt mountOptions gt defaults noatime nodiratime lt mountOptions gt lt raid gt lt diskSetup gt D 6 Example Logical Volume Manager This example shows a simple LVM setup The member tags must refer to an id attribute of a partition tag or an id attribute of a raid tag Note that when LVM is used the administrator is responsible for ensuring that the dm mod kernel module
125. Bright Cluster Manager are described briefly For all packages references to the complete documentation are provided 12 1 Modules Environment The modules environment http modules sourceforge net allows a user of a cluster to modify the shell environment for a particular appli cation or even a particular version of an application Typically a module file defines additions to environment variables such as PATH LD_LIBRARY _PATH and MANPATH Cluster users use the module command to load or remove modules from their environment Details on the modules en vironment from a user s perspective can be found in the Bright Cluster Manager User Manual All module files are located in the cm local modulefiles and cm shared modulefiles trees A module file is a TCL script in which spe cial commands are used to define functionality The modulefile 1 man page has more detail on this Modules can be combined in meta modules By default the default environment meta module exists which allows a user to load a number of other modules at once Cluster administrators are encouraged to cus tomize the default environment meta module to set up a recommended environment for their users The default environment meta module is empty by default 12 2 Shorewall Bright Cluster Manager uses Shoreline Firewall more commonly known as Shorewall package to provide firewall and gateway functionality on the head node of a cluster Shorewall is a
126. By placeholders here it is meant that an incomplete node object is set For example sometimes it is useful to create a node object with the MAC address setting unfilled because it is still unknown Why this can be useful is covered shortly 6 6 2 Adding New Nodes With The Node Creation Wizard Besides adding nodes using the add command of cmsh or the Add but ton of cmgui as in the previous section there is also a cmgui wizard that guides the administrator through the process the node creation wizard This is useful when adding many nodes at a time It is available from the Slave Nodes resource by selecting the Overview tabbed pane and then the Create Nodes button Figure 6 21 Bright Computing Inc 6 6 Adding New Nodes 111 This wizard should not be confused with the closely related node iden tification wizard described earlier in section 6 3 2 which identifies unas signed MAC addresses and switch ports and helps assign them node names The node creation wizard instead creates an object for nodes assigns them node names but it leaves the MAC address field for these nodes unfilled keeping the node object as a placeholder Figure 6 22 Eile Monitoring View Help RESOURCES Slave Nodes E Bright 5 1 Cluster externainet gt ot oie ea Units Modified Hostname v MAC w Category S AP X fig Software Images node001 00 0C 29 80 6E 7E slave 10 141 0 1 default image noded02 00 0C 2
127. C Open64 C compiler e openf90 Open 4 Fortran 90 compiler e openf95 Open64 Fortran 95 compiler Full documentation for the AMD Open64 Compiler Suite is available at http www amd com Bright Computing Inc 218 Third Party Software 12 3 5 FLEXIm License Daemon Package name flex1m For the Intel and PGI compilers a FLEXIm license must be present in the cm shared licenses tree For workstation licences i e a license which is only valid on the head node the presence of the license file is typically sufficient However for floating licenses i e a license which may be used on several machines possibly simultaneously the FLEXIm license manager 1mgrd must be running The lmgrd service serves licenses to any system that is able to con nect to it through the network With the default firewall configuration this means that licenses may be checked out from any machine on the internal cluster network Licenses may be installed by adding them to cm shared licenses 1mgrd license dat Normally any FLEXIm li cense starts with the following line SERVER hostname MAC port Only the first FLEXIm license that is listed in the license dat file used by lmgrd may contain a SERVER line All subsequent licenses listed in license dat should have the SERVER line removed This means in prac tice that all except for the first licenses listed in license dat start with a line DAEMON name full path to vendor daemon The DAEMO
128. Ethernet Swi IB Swit Rack Generic Dev Chassis metric Gpu unit metric Figure 10 25 cmgui Monitoring Main Metrics Tab Edit Dialog Description the description of the metric Command the command that carries out the script or the full path to the executable script Command timeout After how many seconds the script should stop running in case of no response Parameter an optional value that is passed to the script Cumulative whether the value is cumulative for example the bytes received counter for an ethernet interface or non cumulative for example temperature Unit the unit in which the metric is measured When to run Disabled if ticked the metric script does not run Only when idle if ticked the metric script only runs when the system is idling This burdens the system less if the metric is resource hungry Sampling Method the options are Sampling on master The head node samples the metric on behalf of a device For example the head node may do this for a PDU since a PDU does not have the capability to run the cluster management daemon at present and so cannot itself pass on data values directly when cmsh or cmgui need them Sampling on slave The non head node samples the metric itself e Class An option selected from Misc Bright Computing Inc 188 Cluster Monitoring CPU GPU Disk Memory Network Environmenta
129. FTFO8KQXI8J1PXMOh6vv0PtP5rw5D5V2cyVe2i4ez9Y8XMVEcbf601ptKyY bRU jQq 9SN jt 12ESU67YyLstSN68ach9AfO03P0SZIKkiNwfAO VBILv2Mhn7xd74 5LOM eJ71HSpeJA2Rzs6szc2340b VxG GW j ogaK3NE1SY0zQot kOVMdMWsQm 8 Ras19IA9P5 j1SbcZQ1H1P jndS4x4XQ8P41ATczsIDyWhsJC51rTuw9 Q07fqvvPn xsRz1pFmiiN71I4JLjwOnAlXexn4EaeVa7EbtulT j vxJZNdShs7Td740m1F7RKFccl wLuISQQYEQIACQUCSq1h6wIbDAAKCRDvaS9mtk3m0C oAJsHMmKrLPhjCdZyHbB1i e19 5 JABUWCfUOP oawBNOHzDnfr3MLaTgCwjsEE WJX7 Bright Computing Inc CMDaemon Configuration File Directives This Appendix lists all configuration file directives that may be used in the cluster management daemon configuration file cm local apps cmd etc cmd conf To activate changes in a configuration file the cmd service must be restarted and is normally done with the command service cmd restart Master directive Syntax Master hostname Default Master master The cluster management daemon treats the host specified in the Master directive as the head node A cluster management daemon running on a node specified as the head node will start in head mode On a regular node it will start in node mode Port directive Syntax Port number Default Port 8080 The number used in the syntax above is a number between 0 and 65535 The standard port is 8080 The Port directive controls the non SSL port that the cluster manage ment daemon listens on In practice all communication with the cluster management daemon is carrie
130. N line must refer to the vendor daemon for a specific ap plication For PGI the vendor daemon called pgroupd is included in the pgi package For intel the vendor daemon called INTEL must be installed from the flex1m intel Installing the flexlm package adds a system account Imgrd to the password file The account is not assigned a password so it can not be used for logins The account is used to run the Imgrd process The Imgrd service is not configured to start up automatically after a system boot but can be configured to do so with chkconfig lmgrd on The 1mgrd service is started manually with etc init d lmgrd start The lmgrd service logs its transactions and any errors to var log 1mgrd log More details on FLEXIm and the 1lmgrd service are available at http waw rovicorp com 12 4 Intel Cluster Checker Package name intel cluster checker The Intel Cluster Checker is a tool that verifies if a cluster complies with all of the requirements of the Intel Cluster Ready Specification This section lists the steps that must be taken to certify a cluster as Intel Cluster Ready Bright Computing Inc 12 4 Intel Cluster Checker 219 12 4 1 Preparing Cluster The Intel Cluster Ready specification requires a number of packages to be installed on the head and regular nodes The cm config intelcompliance master and cm config intelcompliance slave packages are installed on the head node and software images respectively The i
131. OpenFabrics Enterprise Distribution packages that are part of the Linux base distri bution are used By default all relevant OFED packages are installed on the head node and software images It is possible to replace the native OFED with a custom OFED on the entire cluster or on certain software images Administrators may choose to switch to a different OFED ver sion if the HCAs used are not supported by the native OFED version or to increase performance using an OFED version that has been optimized for a particular HCA If the InfiniBand network was enabled during installation the openibd service was scheduled to be started at boot up for all nodes The openibd service takes care of loading the relevant InfiniBand HCA kernel mod ules When adding an InfiniBand network after installation it may be necessary to use chkconfig manually to configure the openibd service to be started at boot time on the head node and inside the software images 4 4 2 Subnet Managers Every InfiniBand subnet requires at least one Subnet Manager to be run ning The Subnet Manager takes care of routing addressing and initial ization on the InfiniBand fabric Some InfiniBand switches include subnet managers However on large InfiniBand networks or in the absence of a switch hosted Subnet Manager a Subnet Manager needs to be started on at least one node inside of the cluster When multiple Subnet Managers are started on the same InfiniBand subnet one instance wil
132. Parameter Value Common name maureen Group ID 502 Home directory home maureen Login shell bin bash Password 2 k kk k k User ID 502 User name maureen If however commit were to be run at the user mode level without drop ping down to the username level then instead of just that modified user all modified users and groups would be committed When the commit is done all the empty fields for the user are auto matically filled in with defaults based the underlying Linux distribution used Also as a security precaution if an empty field that is a not set password entry is committed then a login into the account is not allowed So while the account exists at this stage it still cannot be logged into until the password is set Logging in requires first editing a property of user Bright Computing Inc 122 User Management maureen namely the empty password field Editing passwords and other properties will be covered next 7 2 3 Editing Properties Of Users And Groups This corresponds roughly to the functionality of the Edit button opera tion in section 7 1 In the above section 7 2 2 a user account maureen was made which had as one of its properties an unset password Account logins with an unset password are refused and so the password needs to be set if the account is to function Editing Users With set And clear The tool used to set user and group properties is the set command Typ ing set and then e
133. Temperature Total number of milliseconds spent by all reads Total number of reads completed successfully Number of running jobs Number of processes in runnable state Temperature of a Hard Disk Assembly Number of remapped sectors Frequency of errors appearance while posi tioning the head Average efficiency of operations whilst posi tioning the head Frequency of program errors while reading data Total number of sectors read successfully Total number of sectors written successfully System or CPU fan speed sensor Temperature sensor system and CPU Motherboard voltage sensor Free swap space Used swap space Total number of good packets received and directed to the broadcast address Switch CPU utilization estimation Total number of collisions on this network segment Bright Computing Inc continues 292 Metrics Health Checks And Actions Table H 1 1 List Of Metrics continued Name Description SwitchDelayDiscardFrames SwitchFilterDiscardFrames SwitchMTUDiscardFrames SwitchMulticastPackets SwitchOverSizedPackets SwitchUnderSizedPackets SwitchUptime TotalCPUIdle TotalCPUSystem TotalCPUUser TotalMemoryUsed TotalNodes TotalSwapUsed Uptime UsedSpace WriteTime Writes await_sda gpu ilo ipForwDatagrams ipFragCreates ipFragFails ipFragOKs ipInAddrErrors Number of frames discarded due to excessive transit delay through the bridge
134. ToSGEConfig is set to true CMDaemon will not make any modifications to the SGE configuration FreezeChangesToPBSConfig directive Syntax FreezeChangesToPBSConfig truelfalse Default FreezeChangesToPBSConfig false When FreezeChangesToPBSConfig is set to true CMDaemon will not make any modifications to the PBS configuration FreezeChangesToTorqueConfig directive Syntax FreezeChangesToTorqueConfig truel false Default FreezeChangesToTorqueConfig false When FreezeChangesToTorqueConf ig is set to true CMDaemon will not make any modifications to the Torque configuration ProvisioningNodeAutoUpdate directive Syntax ProvisioningNodeAutoUpdate truelfalse Default ProvisioningNodeAutoUpdate true If ProvisioningNodeAutoUpdate is set to true provisioning nodes are 1 automatically updated every 24 hours 2 automatically updated when a provisioning request is made if the ProvisioningNodeAutoUpdateTimer directive allows it These updates are disabled if ProvisioningNodeAutoUpdate is set to false ProvisioningNodeAutoUpdateTimer directive Syntax ProvisioningNodeAutoUpdateTimer number Default ProvisioningNodeAutoUpdateTimer 300 When the head node receives a provisioning request it checks if the last update of the provisioning nodes is more than number seconds ago If this is the case an update is triggered The update is disabled if ProvisioningNodeAutoUpdate is set to false Bright Computing Inc 263
135. Volumes gt lt xs complexType gt lt xs sequence gt lt xs element name volume type logicalVolume minOccurs 1 maxOccurs unbounded gt lt xs sequence gt lt xs complexType gt lt xs element gt lt xs sequence gt Bright Computing Inc D 2 Example Default Node Partitioning 269 lt xs complexType gt lt xs complexType name logicalVolume gt lt xs sequence gt lt xs element name name type xs string gt lt xs element name size type size gt lt xs group ref filesystem minOccurs 0 maxOccurs 1 gt lt xs sequence gt lt xs complexType gt lt xs schema gt D 2 Example Default Node Partitioning The following example shows the default layout used for regular nodes This example assumes a single disk Because multiple blockdev tags are used the node installer will first try to use dev sda and then dev hda For each partition a size is specified Sizes can be specified using megabytes 500M gigabytes 50G or terabytes 2T Alternatively a max size will use all remaining space For swap partitions a size of auto will result in twice the nodes memory size In this case all file systems are specified as ext3 valid alternatives are ext2 and xfs For details on mount options please refer to the mount man page Note that if the mountOptions tag is left empty its value will default to defaults lt xml version 1 0
136. Y echo POWERCONTROL POWERCONTROL echo PARTITION PARTITION echo GATEWAY GATEWAY echo PDUS PDUS echo ETHERNETSWITCH ETHERNETSWITCH for interface in INTERFACES do eval type INTERFACE_ interface _TYPE eval ip INTERFACE_ interface _IP eval mask INTERFACE_ interface _NETMASK echo interface type type echo interface ip ip echo interface netmask mask done The initialize script runs after the install mode type and execution have been determined but before unloading specific drivers and before partitions are checked and filesystems mounted Data output can be writ ten by it to writeable parts of the NFS drive For a finalize script which runs just before switching from using the ramdrive to using the local hard drive the local hard drive is mounted under localdisk Data can therefore be written to there if needed for example predetermined configuration files from the NFS drive for a par ticular node Example bin bash ln sf etc myapp conf HOSTNAME localdisk etc myapp conf One way of writing the environment out to the local disk is to redirect and append the variables to the drive For example as illustrated by the following Bright Computing Inc 279 Example bin bash echo HOSTNAME HOSTNAME gt localdisk env echo HWTAG HWTAG gt gt localdisk env echo MAC MAC gt gt localdisk env Data stored earlier by an initialize script can be copied over
137. akes care of any changes needed in the slapd conf file when a head node changes state from passive to active or vice versa and also ensures that the active head node propagates its LDAP database changes to the passive node via a syncprov syncrepl configuration in slapd conf External LDAP Server With No Replication Locally Case In the case of an external LDAP server being used but with no local repli cation involved no special high availability configuration is required The LDAP client configuration in etc ldap conf simply remains the same for both active and passive head nodes pointing to the external LDAP server The file cm images default image etc ldap conf in each image directory also points to the same external LDAP server External LDAP Server With Replication Locally Case In the case of an external LDAP server being used with the external LDAP provider being replicated to the high availability cluster it is gen erally more efficient for the passive node to have its LDAP database prop agated and updated only from the active node to the passive node and not updated from the external LDAP server The configuration should therefore be e an active head node that updates its consumer LDAP database from the external provider LDAP server e a passive head node that updates its LDAP database from the active head node s LDAP database Although the final configuration is the same the sequence in which LDAP replication configura
138. al network En sure that the second NIC i e eth1 is physically connected to the external network Verify that the license parameters are correct cmsh c main licenseinfo If the license being used is a temporary license see End Time value anew license should be requested well before the temporary license expires The procedure for requesting and installing a new license is described in section 4 1 Booting Nodes 1 Make sure the first NIC i e ethO on the head node is physically connected to the internal cluster network Configure the BIOS of nodes to boot from the network and boot the nodes Ifeverything goes well the Node Installer component will be started and a certificate request will be sent to the head node If the node does not make it to the Node Installer it is possible that additional kernel modules are needed Section 6 7 contains more information on how to diagnose problems during the node booting process To manually identify each node select Manually select node on each node and identify the node manually by selecting a node entry from the list and choosing Accept Optional To allow nodes to be identified based on Ethernet switch ports consult section 4 5 Optional For larger clusters assigning identities to nodes can be te dious to do manually The Node Identification Wizard Section 6 3 2 running from cmgui automates the process so that nodes do not re quire manual ide
139. aller clusters a configuration where LDAP clients on all nodes point directly to the external server is recommended An easy way to set this up is as follows e On the head node the URIs in etc ldap conf and in the image file cm images default image etc ldap conf are set to point to the external LDAP server the updateprovisioners command Section 6 1 4 is run to up date any other provisioners e Then to update configurations on the regular nodes They can simply be rebooted to pick up the updated configu ration Alternatively to avoid a reboot the imageupdate command section 6 5 2 can be run to pick up the new image from a pro visioner e In the CMDaemon configuration file cmd conf Appendix C If another LDAP tool is to be used to manage external LDAP user management instead of cmgui or cmsh then altering cmd conf is not required If however system users and groups are to be managed via cmgui or cmsh then CMDaemon too must refer to the external LDAP server instead of the default LDAP server on the head node To set that up The LDAPHost LDAPUser LDAPPass and LDAPSearchDN di rectives in cmd conf are changed to refer to the external LDAP server CMDaemon is restarted to enable the new configurations For larger clusters the preceding solution can cause issues due to traffic latency security and connectivity fault tolerance If such occur a bet ter solution is to replicate the
140. already been set up The resulting combination setup then retains user information such as the login shell home directory and UID in the LDAP database while the password and validity period information are managed by the Kerberos database 7 4 1 Matching realms Both LDAP and Kerberos manage different realms such as example com or cm cluster For LDAP to authenticate against Kerberos there must be a matching realm between them Changing the LDAP realms to match the Kerberos realm is done as follows 1 The Kerberos realm can be accessed in etc krb5 conf on the Ker beros server Its value is noted 2 In cm local apps openldap etc slapd conf these lines should be updated to match the Kerberos realm by replacing dc cm dc cluster suffix dc cm dc cluster rootdn cn root dc cm dc cluster The LDAP server is then restarted with the command service ldap restart 3 The ldap conf file on all nodes should also be modified to match the new realm by modifying dc attributes in the following line base dc cm dc cluster This modification can be implemented by changing a etc ldap conf on the head node b cm images lt image gt etc 1dap conf on the head node where lt image gt indicates the image used for the non head nodes Run ning the imageupdate command section 6 5 2 then implements the changes to the non head nodes Bright Computing Inc 130 User Management 7 4 2 Configuring The LDAP Server As A K
141. alues Example bright51 gt device showport 00 0C 29 01 0F F8 switch01 8 bright51 gt device set node003 ethernetswitch switch01 8 bright51 gt device commit node identification wizard The node identification wizard tabbed pane under the Slave Nodes resource Figure 6 12 is roughly the cmgui equivalent to the newnodes command of cmsh Like newnodes the wizard lists the MAC address and switch port of any unassigned node that the head node detects Additionally it can help assign a node name to the node assuming the node name exists for example by running the node creation wizard of Section 6 6 2 After assignment is done the new status is saved with the Save button of the Overview tabbed pane X Bright Cluster Manager ooa Eile Monitoring View Help RESOURCES 2 Slave Nodes S Bright 5 1 cluster vE Node Categories Overview Stanis Tasks MPI Node Identification Wizard slave v0 Head Nodes First appeared Switch Node Action E bright51 00 0C 29 A0 B3 1B Tue 11 Jan 2011 11 54 17 CET Auto x none b Chassis Paa G node001 r gt GPU Cis Assign both mac and switch port J Refresh M Assign b E Other Devices Ready Figure 6 12 Node Identification Wizard Bright Computing Inc 6 3 Node Installer 97 The most useful way of using the wizard is for node assignment in large clusters To do this it is assumed that the node objects have already been cre ated for th
142. amed user Other useful commands are tracejob id tracejob n d id qmgr qterm pbsnodes a show what happened today to job id search last d days sets individual server type stuff terminates queues but cm starts pbs_server again list available worker nodes in queue The commands of PBS Pro are documented in the PBS Professional 10 4 Reference Guide There is further extensive documentation for PBS Pro administrators in the PBS Professional 10 4 Administrator s Guide Both are available at the PBS Works website at http www pbsworks com SupportDocuments aspx Bright Computing Inc Metrics Health Checks And Actions This appendix describes the metrics health checks and actions in a newly installed cluster H 1 Metrics And Their Parameters H 1 1 Metrics Table H 1 1 List Of Metrics Name Description AlertLevel Indicates the healthiness of a device the lower the better AvgExpFactor Average Expansion Factor This is by what factor on average jobs took longer to run than expected The expectation is according to heuristics based on duration in past and current job queues as well as node availabil ity AvgJobDuration Average Job Duration of current jobs BufferMemory System memory used for buffering BytesRecv Number of bytes received BytesSent Number of bytes sent CMDActiveSessions Managed active sessions count CMDCycleTime Time used by master to process picked up data
143. anges to software images and performing package management can be found in chapter 9 3 1 3 Node Categories A node category is a group of ordinary nodes that share the same config uration Node categories exist to allow an administrator to configure a large group of nodes at once In addition it is frequently convenient to perform certain operations e g a reboot on a number of nodes at a time Bright Computing Inc 3 1 Concepts 27 A node is in exactly one category at all times and is by default in the slave category Nodes are typically divided into node categories based on the hard ware specifications of a node or based on the task that a node is to per form Whether or not a number of nodes should be placed in a separate category depends mainly on whether the configuration e g monitoring setup for these nodes will differ from the rest of the nodes One of the parameters of a node category is the software image that is to be used for all of the nodes inside the category However there is no requirement for a one to one correspondence between node categories and software images Therefore multiple node categories may use the same software image Example By default all nodes are placed in the slave category Alternative cate gories can be created and used at will such as Node Category Description nodes ib nodes with InfiniBand capabilities nodes highmem nodes with extra memory login login nodes
144. apps cmd etc cmd conf see section 3 7 2 Appendix C the default value of Power0ff PDUOutlet is false It can be set to true on the head node and CMDaemon restarted to activate it With Power0ffPDUOutlet set to true it means that CMDaemon after receiving an IPMI based power off instruction for a node and after pow ering off that node also subsequently powers off the PDU port Power ing off the PDU port shuts down the BMC which saves some additional power typically a few watts per node When multiple nodes share the same PDU port the PDU port only powers off when all nodes served by that particular PDU port are powered off When a node has to be started up again the sequence with PowerOffPDUOutlet set to true is A power on instruction via cmsh or cmgui has CMDaemon power on the PDU port This starts up the BMC in the node Once the BMC is up an IPMI based power on instruction from CMDaemon powers on the node Bright Computing Inc 76 Power Management 5 1 4 Custom Power Control For a device which cannot be controlled through any of the standard ex isting power control options it is possible to set a custom power manage ment script This is then invoked by the cluster management daemon on the head node whenever a power operation for the device is done Power operations are described further in section 5 2 Using custompowerscript To set a custom power management script for a device the powercontrol attribute i
145. arameters 0 0 5 2 Power Operations 2 2 2 ee 5 3 Monitoring Power 2 2 ee Node Provisioning 6 1 Provisioning Nodes 0 0000 00004 6 2 Software Images 2 00 ria deara kka ie 6 3 Node Install t gerir See eer ee nga gee 64 Node States to ful ta AP tee tek A Pe le ON Hn oo ui 25 25 28 29 32 35 38 50 53 53 59 65 67 70 73 73 78 81 Table of Contents 10 11 12 6 5 Updating Running Nodes 0 0 108 6 6 Adding New Nodes 0 2 0 0000 4 110 6 7 Troubleshooting The Node Boot Process 112 User Management 119 7 1 Managing Users And Groups With cmgui 119 7 2 Managing Users And Groups With cmsh 120 7 3 Using An External LDAP Server 124 7 4 Using Kerberos Authentication 0 0 129 7 5 Tokens And Profiles n Siira h e a dha arp a D a 131 Workload Management 135 8 1 Workload Managers Choices And Installation 135 8 2 Forcing Jobs To Run In A Workload Management System 136 8 3 Enabling Disabling And Monitoring Workload Managers 136 8 4 Configuring And Running Individual Workload Managers 140 8 5 Using cmgui With Workload Management 144 8 6 Using cmsh With Workload Management 147 8 7 Examples Of Workload Management Assignment 152 Software Image Management 157 9 1 Bright Cluster Manager RPM Packages 157 9 2 Installing amp Upgrading
146. ariables that can be used as well as config uration suggestions built ins not standalone scripts Standalone scripts are located in cm local apps cmd scripts healthchecks Bright Computing Inc 298 Metrics Health Checks And Actions H 2 2 Parameters For Health Checks Health checks have the parameters indicated by the left column in the example below Example myheadnode gt monitoring gt healthchecks show cmsh Parameter Class of healthcheck Command Description Disabled Extended environment Name Only when idle Parameter permissions Sampling method State flapping count Timeout Valid for Value internal cm local apps cmd scripts healthchecks cmsh Checks whether a the cmsh is available i e wet no no cmsh no optional samplingonslave 7 10 slave master pdu ethernet myrinet ib racksensort The parameters have the same meaning as for metrics with the following exceptions due to inapplicability Parameter Reason For Inapplicability class prototype cumulative measurementunit retrievalmethod maximum minimum only applies to metric collections only sensible for numeric values only applies to numeric values all health checks use CMDaemon internally for retrieval only applies to numeric values only applies to numeric values The remaining parameters have meanings that can be looked up in section H 1 2 Bright Computing Inc
147. arted or stopped using etc init d sgemaster sqe1 start stop or alter natively via qconf s k n e The sge_execd execution daemon running on each compute node accepts manages and returns the results of the jobs on the compute nodes The daemon can be started or stopped via etc init d sgeexecd start stop or alternatively deregistered from qmaster via qconf s k s e Queues in an error state are cleared with a qmod c lt queue name gt SGE can be configured and managed generally with the command line utility qconf which is what most administrators become familiar with A GUI alternative qmon is also provided SGE commands are listed below The details of these are in the man page of the command and the SGE documentation e galter modify existing batch jobs e qacct show usage information from accounting data e qconf configure SGE e qdel delete batch jobs Bright Computing Inc 286 Workload Managers Quick Reference G 2 ghold place hold on batch jobs ghost display compute node queues states jobs qlogin start login based interactive session with a node qmake distributed parallel make utility qmod suspend enable queues and jobs qmon configure SGE with an X11 GUI interface qping check sge_qmaster and sge_execd status qquota list resource quotas qresub create new jobs by copying existing jobs qrde1 cancel advance reservations qr1s release batch jobs from a held state qrsh start rsh
148. at can be performed on a device depend on the type of device For ex ample it is possible to mount a new filesystem to a node but not to an ethernet switch Every device that is present in the cluster management infrastructure has a device state associated with it The table below describes the most important states for devices Device State Description UP device is reachable DOWN device is not reachable CLOSED device has been taken offline by administrator There are a number of other states which are described in detail in Chap ter 6 on node provisioning DOWN and CLOSED states have an important difference In the case of DOWN the device was intended to be available but instead is down In the case of CLOSED the device is intentionally unavailable 3 1 2 Software Images A software image is a blueprint for the contents of the local file systems on an ordinary node In practice a software image is a directory on the head node containing a full Linux file system When an ordinary node boots the node provisioning system sets up the node with a copy of the software image Once the node is fully booted it is possible to instruct the node to re synchronize its local filesystems with the software image This procedure can be used to distribute changes to the software image without rebooting nodes Software images can be changed using regular Linux tools and com mands such as rpm and chroot More details on making ch
149. at can be set in the cluster Some of these metrics are built ins such as CPUUser in the basic example of section 10 1 Other metrics are standalone scripts New custom metrics can also be built and added as standalone commands or scripts Metrics can be manipulated and configured The Save button saves as yet uncommitted changes made via the Add or Edit buttons The Revert button discards unsaved edits made via the Edit button The reversion goes back to the last save The Remove button removes a selected metric from the list The remaining buttons Edit and Add open up options dialogs These are now discussed Metrics The Main Tab s Edit And Add Options The Metrics tab of Figure 10 24 has Add and Edit buttons The Add button opens up a dialog to add a new metric to the list and the Edit button opens up a dialog to edit a selected metric from the list Both dialogs have the following options Figure 10 25 e Name the name of the metric Bright Computing Inc 10 4 Monitoring Configuration With Cmgui 187 Name CPUUser Description Total core usage in user mode per sec Command Command timeout 5 Parameter Disallowed Cumulative v Unit Disabled Only when idle Sampling Method Sampling on slave Class cpu z Retrieval Method CMDaemon X State flapping count 7 Absolute range 0 to 0 Slave Node metric v Master Node metric v Powerdistribution unit metric Myrinet Switch metric
150. ate can be carried out by selecting the specific node or specific category from the resource tree Then within the tasks tabbed pane that opens up the Update node button is clicked Fig ure 6 20 This opens up a dialog which has a dry run checkbox marked by default File Monitoring View Help RESOURCES node001 E Bright 5 1 Cluster gt E Switches gt C Networks gt 9 Power Distribution Units b Software Images Software image aoa amp aoa O Node Categories E slave a eon aoe b gj Head Nodes Reinstall node b Chassis Are you sure you want update this Gj Slave Nodes node en Drain Mesias b C Other Devices b C Node Groups Ready Figure 6 20 Updating A Running Node s Image With cmgui The dry run can be reviewed by clicking on the Provisioning Log button further down the same tabbed pane The update can then be done again with the dry run check mark off to actually implement the update Updating an image via cmsh or cmgui automatically updates the pro visioners first if the provisioners have not been updated in the last 5 min utes Bright Computing Inc 110 Node Provisioning There are two scripts associated with the imageupdate command that may run as part of its execution e The imageupdate_initialize script runs before the node image starts updating If the imageupdate_initialize script exits with non zero then the image does not update
151. ath to keyfile gt lt verify info gt root myheadnode cd cm local apps cmd etc root myheadnode etc verify license cert pem cert key info Certificate Information Version 5 1 Edition Advanced Common name Bright 5 1 Cluster Organization Bright Computing Organizational unit Development Locality San Jose State California Country US Serial 2603 Starting date 29 Jun 2010 Expiration date 29 Nov 2010 MAC address Parca greater TORCLET T Licensed nodes 3 root myheadnode etc The verify option of verify license checks the validity of the license e Ifthe license is valid then no output is produced and the utility exits with exit code 0 e Ifthe license is invalid then output is produced indicating what is wrong Messages such as these are then displayed If the license is old root myheadnode etc verify license cert pem cert key verify License has expired License verification failed If the certificate is not from Bright Computing root myheadnode etc verify license cert pem cert key verify Invalid license This certificate was not signed by Bright Computing License verification failed 4 1 3 Requesting And Installing A License Using A Product Key Verifying License Attributes It is important to verify that the license attributes are correct before pro ceeding with cluster configuration In particular the license date should be checked to make sure that the license has n
152. been carried out on the system then a threshold called killallyesthreshold is already there with an assigned action killallyes The properties of each threshold can be shown some prompt text elided for layout purposes Example metricconf thresholds metricconf CPUUser thresholds metricconf CPUUser gt thresholds list Name key Bound Severity killallyesthreshold 50 10 metricconf CPUUser gt thresholds show killallyesthreshold Parameter Value Actions enter killallyes Bound 50 Name killallyesthreshold Severity 10 UpperBound yes The meanings of the parameters are explained in the GUI equivalent of the above example in section 10 4 2 in the section labeled Metric Con figuration Thresholds Options The object manipulation commands in troduced in section 3 6 3 will work as expected at this cmsh prompt level add and remove will add and remove a threshold set get and clear will set and get values for the parameters of each threshold refresh and commit will revert and commit changes use will use the speci fied threshold making it the default for commands validate applied to the threshold will check if the threshold object has sensible values and append and removefrom will append an action to and remove an action from a specified threshold The append and removefrom commands correspond to the and widgets of cmgui in Figure 10 19 and work with parameters that can have multiple
153. ble actions is displayed A new action is added by entering the following values in the Add dialog Figure 10 2 e action name killallyes e description kill all yes processes e command cm local apps cmd scripts actions killallyes The Save button adds the action killallyes to the list of possible actions which means that the action can now be used throughout the monitoring framework Setting Up The Threshold Level For CPUUser On The Head Node s Continuing on the Metric Configuration tab is selected Then within the selection box options for Metric Configuration All Master Nodes is selected to confine the metrics being measured to the head node s The metric CPUUser which is a measure of the user mode CPU usage as a percentage is selected The Thresholds button is clicked on to open a Thresholds dialog Within the Thresholds dialog the Add is clicked button to open up a New Threshold dialog Within the New Thresholds dialog Figure 10 3 these values are set e threshold name killallyesthreshold e upper bound 50 e action name first selection box in the action option killallyes e action state option next selection box in the action option Enter Bright Computing Inc 166 Cluster Monitoring Eile Monitoring View Help fj i r RESOURCES A Monitoring Configuration E My Cluster 3 My Clusters o W Metric Configuration Health fig Metrics Healthc Ej My Cluster Gi swi
154. bqueue list Type Name sge all q torque default Bright Computing Inc 150 Workload Management torque longq torque shortq bright51 gt jobqueue scheduler torque Working scheduler is torque bright51 gt jobqueue torque list Type Name torque default torque longq torque shortq jobqueue Mode In cmsh Other Object Manipulation Commands The usual object manipulation commands of section 3 6 3 work at the top level mode as well as in the scheduler submode Example bright51 gt jobqueue list torque Type Name torque default torque longq torque shortq bright51 gt jobqueue show torque longq Parameter Value Maximal runtime 23 59 59 Minimal runtime 00 00 00 Queue type Execution Routes Type torque name longq nodes node001 cm cluster node002 cm cluster bright51 gt jobqueue get torque longq maximalruntime 23 59 59 bright51 gt jobqueue bright51 gt jobqueue scheduler torque Working scheduler is torque bright51 gt jobqueue torque list Type Name torque default torque longq torque shortq bright51 gt jobqueue torque show longq Parameter Value Maximal runtime 23 59 59 Minimal runtime 00 00 00 Queue type Execution Routes Type torque name longq nodes node001 cm cluster node002 cm cluster bright51 gt jobqueue torque get longq maximalruntime 23 59 59 Bright Computing Inc 8 6 Using cmsh With Workload Management 151 bright51 gt jobqueue torque u
155. cal etc fstab file In addition to all the mount points defined in the drive layout several extra mount points can be added These extra mount points such as NFS imports proc sys and dev shm can be defined both in the node s category and in the node configuration From cmsh the extra mount points can be managed from the fsmounts submode of the category or device mode 6 3 10 Installing GRUB Bootloader Optionally the node installer installs a boot record on the local drive if the installbootrecord property of the node configuration or node category is set For this to work e network booting should have a lower priority in the BIOS of the node than hard drive booting Bright Computing Inc 106 Node Provisioning e the GRUB bootloader with a boot record must be installed in the MBR of the local drive overwriting the default gPXE boot record To do this in cmgui the Install boot record checkbox must be ticked and saved in the node configuration or in the node category The cmsh equivalents are commands like cmsh c device use node001 set installbootrecord yes commit or cmsh c category use slave set installbootrecord yes commit This ensures that the next boot is from GRUB on the hard drive in stead of a boot from the head node image via the network Simply unsetting the Install boot record setting and rebooting the node does not restore its gPXE booting To restore its gPXE booting it can
156. cally adds any NFS Lustre FUSE PanFS FhGFS GlusterFS and GPFS imported file systems on the node If this were not done all data on these filesystems would be wiped since they are not part of the software image Bright Computing Inc 6 5 Updating Running Nodes 109 6 5 2 Updating Running Nodes imageupdate Using a defined excludelistupdate property the imageupdate command of cmsh is used to start an update on a running node Example mycluster gt device imageupdate n node001 Performing dry run use synclog command to review result then pass w to perform real update Tue Jan 11 12 13 33 2011 bright51 Provisioning started on node node001 bright51 gt device imageupdate n node001 image update in progress bright51 gt device Tue Jan 11 12 13 44 2011 bright51 Provisioning completed on node node0 o1 By default the imageupdate command performs a dry run which means no data on the node is actually written Before passing the w switch it is recommended to analyze the rsync output using the synclog command Section 6 3 7 If the user is now satisfied with the changes that are to be made the imageupdate command is invoked again with the w switch to imple ment them Example mycluster gt device imageupdate n node001 w Provisioning started on node node001 node001 image update in progress mycluster gt device Provisioning completed on node node001 In cmgui an image upd
157. cate operation from within cert mode Alternatively it can be gen erated within cmgui by using the Add dialog of the Certificates tabbed pane within the Authentication resource Every cluster management operation requires the user s profile to have the relevant tokens for the operation Profiles are handled with the profiles mode of cmsh or from the Authorization resource of cmgui The following default profiles are avail able Profilename Default Tasks Allowed Admin all tasks Node node related Readonly view only CMHealth health related Custom profiles can be created to include a custom collection of capa bilities in cmsh and cmgui Cloning of profiles is also possible from cmsh 7 5 1 Creating A New Certificate For cmsh Users Creating a new certificate in cmsh is done from cert mode using the createcertificate command which has the following help text bright51 gt cert help createcertificate So Usage createcertificate lt key length gt lt common name gt lt organization gt lt org anizational unit gt lt locality gt lt state gt lt country gt lt profile gt lt sys login gt Bright Computing Inc lt d 132 User Management ays gt lt key file gt lt cert file gt key file nepasarea aig Path to key file that will be generated c rt file reniri iardain Path to pem file that will be generated Accordingly as an example a certificate file with a read only profile set t
158. ccesses tasks that operate on the resource e Settings allows configuration of properties of the resource Bright Computing Inc 36 Cluster Management with Bright Cluster Manager Eile Monitoring View Help ves My Clusters H My Cluster v Switches a8 switch01 v0 Networks 5 externalnet internalnet Gj Power Distribution Units amp apcOl v Software Images Hostname node001 Hardware tag 0000000a000 MAC address 00 30 48 7F 14 36 gt defaultimage Rack ka rF v Position e E A Height a lu Zal v Node Categories slave 5 v Head Nodes Category _slave v T Install boot record 4 mycluster en FJ mycus Installmode AUTO Category v Next boot installmode vy v Slave Nodes lt lt Management network internalnet base Za E node002 v Other Devices vE Node Groups Ethernet switch _ switchOl vy Port 1 z Bb Users amp Groups i i y gt Workload Management Monitoring Configuration aes B z Power controlled by apc X E Authorisation E A B Authentication Custom power script Power Distribution Unit apcOl v Port 1 vy Ji JA Provisioning interface BooTIF cm Provisioning transport Rsync Daemon cm Finalize Script All Events 7v Time amp Cluster wi Source v Message R Figure 3 5 Node Settings For example t
159. cconf exit exit exit gt monitoring setup metricconf masternode gt monitoring gt setup MasterNode gt metricconf exit exit gt monitoring gt setup metricconf masternode gt monitoring gt setup MasterNode gt metricconf exit gt monitoring gt setup MasterNode metricconf gt monitoring gt setup MasterNode gt metricconf A list of metrics that have been set to do sampling for the device cat egory masternode is obtained with list Since there are many of these only 10 lines are displayed in the list shown below by piping it through head Bright Computing Inc 202 Cluster Monitoring Example myheadnode gt monitoring gt setup MasterNode gt metricconf list head Metric Metric Param Samplinginterval AlertLevel max 0 AlertLevel sum 0 AvgExpFactor 120 AvgJobDuration all q 60 BufferMemory 120 BytesRecv etho 120 BytesRecv ethi 120 BytesSent etho 120 BytesSent ethi 120 CMDMemUs ed 120 Besides list an alternative way to get a list of metrics that are set to sample for masternode is to use the tab completion suggestions to the use command The use command is normally used to drop into the configuration properties of the metric so that parameters of the metric object can be configured Example myheadnode gt monitoring gt setup MasterNode gt metricconf use cpuuser myheadnode gt monit oring gt setup MasterNode gt metricconf CPUUse
160. ces Node Groups A Users amp Groups Workload Management I Authorization B Authentication oe My Clusters Overview Metric Configuration Health Check Configuration Metric ee Modified Name Description v Commend ve amp switchor Drain node Remove a node from further use by th lt builtin gt v Networks killprocess Action which kills processes of pids fo cm local apps cmd scripts actions killpro ig externainet Power off Power off the device lt builtin gt S internainet Power on Power on the device lt built in gt yi Power Distribution Units Power reset Power reset the device lt built in gt v Software Images Reboot Reboot the node lt built in gt defaultimage SendEmail Send an email to the address specifie lt built in gt v Node Ee Shutdown Shutdown the node lt builtin gt slave testaction action to test it generates output in a f cm local apps cmd scripts actions testact v Head Nodes Undrain node Enable a node to start running jobs fo lt built in gt 4 myheadnode v chassis v Slave Nodes E node001 pa Name kilallyes E3 nodeoo3 Si GEL pits Description kill all yes processes Command tocalapps cmd scripts actions kilallyes Figure 10 2 cmgui Monitoring Configuration Adding An Action Adding The Action To The Actions List From the resources tree of cmgui Monitoring Configuration is selected and then the Actions tab is selected A list of currently availa
161. chapters Chapter 4 explains how to configure and further set up the cluster after software installation of Bright Cluster Manager on the head node Chapter 5 describes how power management within the cluster works Chapter 6 explains node provisioning in detail Chapter 7 explains how accounts for users and groups are managed Chapter 8 explains how workload management is implemented and used Chapter 9 demonstrates a number of techniques and tricks for work ing with software images and keeping images up to date Chapter 10 explains how the monitoring features of Bright Cluster Manager can be used Chapter 11 summarizes several useful tips and tricks for day to day monitoring Chapter 12 describes a number of third party software packages that play a role in Bright Cluster Manager Chapter 13 gives details and setup instructions for high availability features provided by Bright Cluster Manager These can be followed to build a cluster with redundant head nodes The appendices generally give supplementary details to the main text Bright Computing Inc Installing Bright Cluster Manager This chapter describes the installation of Bright Cluster Manager onto the head node of a cluster Sections 2 1 and 2 2 list hardware requirements and supported hardware while section 2 3 gives step by step instructions on installing Bright Cluster Manager from a DVD onto a head node 2 1 Minimal Hardware Requirements The following are mini
162. check its power status information In the display in Figure 5 4 for a head node the green LEDs indicate that all three PDU ports are turned on Red LEDs would indicate power ports that have been turned off while gray LEDs would indicate an unknown power status for the device Performing power operations on multiple devices at once is possible through the Tasks tabs of node categories and node groups It is also possible to do power operations on ad hoc groups through the Slave Nodes folder in the resource tree The members of the ad hoc group can be selected using the Overview tab and then operated on by a task chosen from the Tasks tab When doing a power operation on multiple devices CMDaemon en sures a 1 second delay occurs by default between every 2 successive de Bright Computing Inc 5 2 Power Operations 79 File Monitoring View Help RESOURCES es mycluster E My Custer Ti My Clusters m Overview Tasks SE My Cluster v Switches E switcho1 v Networks Hostname mycluster Running Jobs Z externalnet Nam User Tim imernainet wane Bp v Gi Power Distribution Units Fi e A apol id YL Software ir Behan msi Load 0 09 1 min 0 04 S min 0 01 15 min default image ee canaries Uptime 2 days 20 hours 8 minu slave a m Memory S mycluster 785 23 MB outof 5 82 GB a Swap Memory G E node001 OB outof 1109 GE lth CPUUsage GEED SE Other Devices 06 u 007 5 003 o 993 Y Node
163. click Continue 7 Specify the number of racks and the number of regular nodes set the base name for the regular nodes and the number of digits to append to the base name Select the correct hardware manufacturer and click Continue 8 Choose a network layout and click Continue The first layout is the most commonly used The rest of this appendix will assume the first layout was chosen 9 Optionally add an InfiniBand network and configure the use of IPMI iLO BMCs on the nodes Adding an IPMI iLO network is only neces sary when the IPMI iLO interfaces should be configured in a dif ferent IP subnet When done click Continue 10 Fill in the following settings for the network named externalnet Bright Computing Inc 282 Quickstart Installation Guide e Base Address a k a network address e Netmask e Domain name e Default gateway The network externalnet corresponds to the site network that the cluster resides in e g corporate or campus network Note that assigning the cluster an IP address in this network will be handled in one of the next screens Click Continue 11 Add and remove DNS nameservers and DNS search domains and click Continue 12 Assign an IP address for the head node on externalnet This is the IP address that will be used to access the cluster over the network 13 If necessary modify the node properties When IPMI iLO inter faces will reside in the same IP subnet set an IP Offset for the ip
164. cluster cmsh mycluster softwareimage mycluster gt softwareimage clone default image lustre server image mycluster gt softwareimage lustre server image commit The RPM Lustre packages can be downloaded from the Lustre website It is best to first check which version of Lustre can be used for a particular distribution against the Lustre Test Matrix at the Lustre wiki at http wiki lustre org index php Lustre_Release_ Information Lustre_Test_Matrix After choosing a Lustre version from the Lustre Test Matrix the appro priate distribution and platform can be chosen For CentOS and Scientific Linux SL RedHat packages can be used To download the packages an account is required The RPM packages to download are e kernel Lustre patched kernel MDS MGS OSS only e kernel modules Lustre modules client and server for the Lustre patched kernel e lustre Lustre userland tools client and server for the Lustre patched kernel e lustre ldiskfs Backing filesystem kernel module MDS MGS OSS only e e2fsprogs Backing filesystem creation and repair tools MDS MGS OSS only In most cases the e2fsprogs distribution package is already installed so the package has to be upgraded It is possible that the Lustre e2fsprogs package conflicts with the e4fsprogs distribution package in which case the e4fsprogs package has to be removed If the Lustre kernel version has a lower version number than the already installed k
165. cluster the privileged user run fails as a result of hardware differences To resolve the failures it is necessary to create mul tiple groups of homogeneous hardware For more information the Intel Cluster Checker documentation can be consulted 12 4 5 Applying for Certificate When both the regular user run as well as the privileged user run have re ported that the Check has Succeeded a certificate may be requested for the cluster Requesting a certificate involves creating a Bill of Materials and submitting it along with the two output files of the two cluster checker runs to cluster intel com The Intel Cluster Ready site contains inter active submissions forms that make the application process as easy as possible 12 5 CUDA In order to take advantage of the computational capabilities of NVIDIA GPUs that may be present in the nodes of a cluster the optional CUDA packages should be installed 12 5 1 Installing CUDA A number of CUDA 3 2 packages exist in the YUM repository Package Type Description cuda32 toolkit shared CUDA3 2 math libraries and utilities cuda32 sdk shared CUDA 3 2 software development kit cuda32 profiler shared CUDA 3 2 profiler cuda32 driver local CUDA 3 2 driver cuda32 libs local CUDA 3 2 libraries The packages marked as shared in the table above should be in stalled on the head nodes of a cluster containing CUDA compatible GPUs The packages marked as local should be installed to
166. communica tion error Removing the node s corresponding certificate directory al lows the node installer to request a new certificate and proceed further Bright Computing Inc 6 3 Node Installer 91 6 3 2 Deciding Or Selecting Node Configuration Once communication with the head node CMDaemon is established the node installer tries to identify the node it is running on so that it can se lect a configuration from CMDaemon s record for it if any such record exists It correlates any node configuration the node is expected to have according to network hardware detected If there are issues during this correlation process then the administrator is prompted to select a node configuration until all nodes finally have a configuration possible node configuration scenarios The correlations process and corresponding scenarios are now covered in more detail It starts with the node installer sending a query to CMDaemon to check if the MAC address used for net booting the node is already as sociated with a node in the records of CMDaemon In particular it checks the MAC address for a match against the existing node configuration prop erties and decides whether the node is known or new e the node is known if the query matches a node configuration It means that node has been booted before e the node is new if no configuration is found In both cases the node installer then asks CMDaemon to find out if the node is connected
167. considered by the Intel Cluster Checker By default only the head node and the first three nodes are included However for a full certification of the cluster all nodes should be included Example Rather than manually creating the nodelist file the following command may be used to generate a nodes list consisting of master and node001 through node150 root mycluster echo master type head for i in 1 150 gt do echo node printf 03d i type compute gt done gt home cmsupport intel cluster ready nodelist File Exclude List The excludefiles file lists all files which should be skipped when scan ning for differences between nodes Modifying the excludefiles list is normally not necessary 12 4 3 Generating Fingerprint Files Before running the Intel Cluster Checker a list of all packages installed on the head node and regular nodes must be generated These lists are used to ensure that the same software is available on all nodes Generating the lists can be done by passing the packages flag to the cluster checker The cluster checker uses the first node defined in the nodes file as the standard Two output files with a time stamp in the filename are produced by the cluster checker The two output files must subsequently be copied to the following location home cmsupport intel cluster ready head package list home cmsupport intel cluster ready node package list Example module load shared in
168. core usage in user mode per second Disabled no Extended environment no Measurement Unit Name cPUUser Only when idle no Parameter permissions disallowed Retrieval method cmdaemon Sampling method samplingonslave State flapping count 7 Timeout 5 Valid for slave master maximum lt range not set gt minimum lt range not set gt myheadnode gt monitoring gt metrics 7 The meanings of the parameters above are explained in appendix H 1 2 Tab completion suggestions for the show command suggest arguments corresponding to names of objects the names returned by the list com mand that may be used in a monitoring mode For metrics mode show followed by a double tap on the tab key displays a large number of pos sible metrics objects Example myheadnode gt monitoring gt metrics show Display all 122 possibilities y or n alertlevel droprecv ipoutrequests avgexpfactor dropsent ipreasmoks avg jobduration errorsrecv ipreasmreqds await_sda errorssent loadfifteen The get command returns the value of an individual parameter of a particular metric object Example myheadnode gt monitoring gt metrics get CPUUser description Total core usage in user mode per second Bright Computing Inc 200 Cluster Monitoring cmsh monitoring metrics add use remove commit refresh modified set clear and validate The remaining commands in monitoring metrics mode add use remove commit refresh modified set
169. cs Health Checks And Actions H 1 2 Parameters For Metrics Metrics have the parameters indicated by the left column in the following example Example myheadnode gt monitoring gt metrics show cpuuser Parameter Value Class of metric cpu Command lt built in gt Cumulative yes Description Total core usage in user mode per second Disabled no Extended environment no Measurement Unit Name cPUUser Only when idle no Parameter permissions disallowed Retrieval method cmdaemon Sampling method samplingonslave State flapping count 7 Timeout 5 Valid for slave master maximum lt range not set gt minimum lt range not set gt myheadnode gt monitoring gt metrics The meanings of the parameters are Class of metric A choice assigned to a metric depending on its type The choices and what they are related to are listed below Command Misc default miscellaneous class of metrics used if none of the other classes are appropriate or if none of the other classes are chosen CPU CPU activity GPU GPU activity Disk Disk activity Memory Memory activity Network Network activity Environmental sensor measurements of the physical environ ment Operating System operating system activity Internal bright cluster manager utilities Workload workload management Cluster clusterwide measurements Prototype metric collections class For a standalone metric script it is the full path For a built in
170. d However the sample_ipmi script in this case simply returns 0 and no output The rationale here being that the administra tor is aware of this situation and would not expect data from that IPMI anyway let alone an error Bright Computing Inc I 5 Environment Variables 303 1 5 Environment Variables The following environment variables are available for a metric collection script as well as for custom scripts On all devices CMD_HOSTNAME name of the device For example CMDHOSTNAME myheadnode Only on non node devices CMD_IP IP address of the device For example CMD_TP 192 168 1 33 Only on node devices Because these devices generally have multiple interfaces the single environment variable CMD_IP is often not enough to express these Multiple interfaces are therefore represented by these environment variables e CMD_INTERFACES list of names of the interfaces attached to the node For example CMD_INTERFACES ethO ethi ipmiO BOOTIF e CMD_INTERFACE_ lt interface gt _IP IP address of the interface with the name lt interface gt For example CMD_INTERFACE_ethO_IP 10 141 255 254 CMD_INTERFACE_ethi_IP 0 0 0 0 e CMD_INTERFACE_ lt interface gt _TYPE type of interface with the name lt interface gt For example CMD_INTERFACE_ethi1_TYPE NetworkPhysicalInterface CMD_INTERFACE_ipmi0_TYPE NetworkIpmiInterface Possible values are NetworkIpmiInterface NetworkPhysicalInterface NetworkVLANIn
171. d Node 306 Table J 1 External Network Parameters And How To Change Them On The Head Node Network Parameter Description Operation Command Used na IP address of head node view cmsh c device interfaces master get eth1 ip on eth1 interface set cmsh c device interfaces master set eth1 ip address commit b ddress base IP address network view cmsh c network get externalnet baseaddress aseaddress address of network set cmsh c network set externalnet baseaddress address commit broadcastaddress broadcast IP address of view cmsh c network get externalnet broadcastaddress roadcastaddress networ se cmsh c network set externalnet broadcastaddress address commi twork t h c network set ext lnet broadcastadd dd E tmaskbit netmask in CIDR notation view cmsh c network get externalnet netmaskbits netmaskbits number after or prefix set cmsh c network set externalnet netmaskbits bitsize commit length gateway default route view cmsh c network get externalnet gateway gateway IP address set cmsh c network set externalnet gateway address commit view cmsh c partition get base nameservers nameservers nameserver IP addresses oe set cmsh c partition set base nameservers address commit searchdomains name of s archdomans view cmsh c partition get base searchdomains set cmsh c partition set base searchdomains hostname commit ti m fti view cmsh c partition get base timeservers i
172. d clicking the button A timezone can be selected from the drop down box if the default is in correct Clicking Continue leads to the Authentication screen described next Bright Computing Inc 20 Installing Bright Cluster Manager Bright Cluster Manager Installer Time Configuration English US Sa Yv PI Mic dit NTF Q Timezone Europe Amsterdam Q Timeservers pool r eS l r r l r Jr bal Add timeservers Time Configuration es coma A comme Figure 2 20 Time Configuration Authentication The Authentication screen Figure 2 21 requires the password to be set twice for the cluster administrator The hostname of the head node can also be modified in this screen Clicking Continue validates the pass words that have been entered and if successful leads to the Console screen described next Bright Cluster Manager Installer Authentication N English US S Head node hostname Password Repeat password Authentication i cc co Figure 2 21 Authentication Bright Computing Inc 2 3 Head Node Installation 21 Console The Console screen Figure 2 22 allows selection of a graphical mode or a text console mode for when the head node or ordinary nodes boot Clicking Continue leads to the Summary screen described next Bright Cluster Manager Installer Console amp 1 I Console cance A cosa M conme Figure 2 22 Console Summary The Summary
173. d command is used to create the new health check object Example root myheadnode cmsh myheadnode monitoring healthchecks myheadnode gt monitoring gt healthchecks add cpucheck myheadnode gt monit oring gt healthchecks cpucheck The set command sets the value of each parameter displayed by a show command some prompt text elided for layout purposes Example Bright Computing Inc 198 Cluster Monitoring set command cm local apps cmd scripts healthchecks cpucheck 4 set description CPUuser under 50 4 set parameterpermissions disallowed set samplingmethod samplingonmaster set validfor master Ce ee a a a E a commit Since the cpucheck script does not yet exist in the location given by the parameter command it needs to be created bin bash echo PASS if CPUUser lt 50 cpu is a 4 ie between O and 100 cpu mpstat 1 1 tail 1 awk print 3 lt comparisonstring cpu lt 50 if be lt lt lt comparisonstring then echo PASS else echo FAIL fi The script should be placed in the location suggested by the object cm local apps cmd scripts healthchecks cpucheck and made exe cutable with a chmod 700 The cpucheck object is handled further within the cmsh monitoring setup mode in section 10 7 4 to produce a fully configured health check 10 7 3 Cmsh Monitoring Metrics The monitoring metrics mode of cmsh corr
174. d out over the SSL port SSLPort directive Syntax SSLPort number Default SSLPort 8081 The number used in the syntax above is a number between 0 and 65535 The standard port is 8081 Bright Computing Inc 256 CMDaemon Configuration File Directives The SSLPort directive controls the SSL port that the cluster manage ment daemon listens on SSLPortOnly directive Syntax SSLPortOnly yes no Default SSLPortOnly no The SSLPortO0n1y directive allows non SSL port to be disabled Normally both SSL and non SSL ports are active although in practice only the SSL port is used CertificateFile directive Syntax CertificateFile filename Default CertificateFile cm local apps cmd etc cmd pem The CertificateFile directive specifies the certificate which is to be used for authentication purposes On the master node the certificate used also serves as a software license PrivateKeyFile directive Syntax PrivateKeyFile filename Default PrivateKeyFile cm local apps cmd etc cmd key The PrivateKeyFile directive specifies the private key which corresponds to the certificate that is being used CACertificateFile directive Syntax CACertificateFile filename Default CACertificateFile cm local apps cmd etc cacert pem The CACertificateFile directive specifies the path to the Bright Cluster Manager root certificate It is normally not necessary to change the root certificate RandomSeedFile d
175. de cmsh myheadnode monitoring setup myheadnode gt monitoring gt setup list Category Metric configuration Health configuration Chassis lt 2 in submode gt lt 1 in submode gt EthernetSwitch lt 13 in submode gt lt 1 in submode gt GenericDevice lt 2 in submode gt lt 1 in submode gt GpuUnit lt 3 in submode gt lt 0 in submode gt IBSwitch lt 13 in submode gt lt 1 in submode gt MasterNode lt 88 in submode gt lt 9 in submode gt MyrinetSwitch lt O in submode gt lt 0 in submode gt PowerDistributionUnit lt 5 in submode gt lt 1 in submode gt RackSensor lt 2 in submode gt lt 1 in submode gt slave lt 25 in submode gt lt 5 in submode gt myheadnode gt monitoring gt setup A device category must always be used when handling the proper ties of the metrics and health checks configurable under the submodes of monitoring setup The syntax of a configuration submode metricconf or healthconf therefore requires the device category as a mandatory argument and tab completion suggestions become quite helpful at this point Bright Computing Inc 10 7 Monitoring Modes With Cmsh 201 Examples are now given of how the metric configuration metricconf and healthcheck configuration healthconf submodes are used cmsh monitoring setup metricconf Continuing with the session above the metricconf option can only be used with a device category specified Tab completion suggestions for metricconf suggest
176. de Chapter 5 has more information on power settings and operations Example mycluster gt device use node001 mycluster gt device node001 get powerdistributionunits apcO1 1 mycluster gt device node001 append powerdistributionunits apc01 5 mycluster gt device node001 get powerdistributionunits apc01 1 apc01 5 mycluster gt device node001 append powerdistributionunits apc01 6 mycluster gt device node001 get powerdistributionunits apc01 1 apc01 5 apc01 6 mycluster gt device node001 removefrom powerdistributionunits apc01 5 mycluster gt device node001 get powerdistributionunits apc01 1 apc01 6 mycluster gt device node001 set powerdistributionunits apc01 1 apc01 02 mycluster gt device node001 get powerdistributionunits apc01 1 apc01 2 mycluster gt device node001 Working With Objects usedby Removing a specific object is only possible if other objects do not have references to it To help the administrator discover a list of objects that depend on use the specified object the usedby command may be used In the following example objects depending on device apc01 are requested The usedby property of powerdistributionunits indicates that device objects node001 and node002 contain references to use the object apc01 In addition the apc01 device is itself displayed as being in the up state indicating a dependency of apc01 on itself If the device is to
177. de or performing an HA setup This section will describe how to Bright Computing Inc 240 High Availability create an HA setup using the cmha setup utility which was specifically created for guiding the process of building an HA setup During the pro cess of setting up HA the cmha setup utility will interact with the cluster management environment using cmsh to create the setup Although it is also possible to create an HA setup manually using either CMGUI or cmsh this approach is not recommended as it is error prone Globally the process of creating an HA setup involves three stages Preparation setting up configuration parameters for the shared in terface and for the secondary head node that is about to be installed Cloning installing the secondary head node is done by creating a clone of the primary head node Shared Storage Setup setting up the method for shared storage 13 2 1 Preparation The following steps will prepare for the cloning of a new head node 0 Power off all slave nodes 1 To start the HA setup run the cmha setup command from a root shell on the primary head node and choose Setup Failover in the main menu N Enter the MySQL root password On a new installation this is the administrator password that was configured during cluster instal lation ies Configure parameters for the virtual shared internal IP address By selecting Create the shared interface will be created 4 Configure
178. ded after the kernel during early boot and contains driver modules for the node s network card and local storage In cmsh the modules that are to go on the ramdisk can be placed using the kernelmodules submode of the softwareimage mode The order in which they are listed is the attempted load order Whenever a change is made via the kernelmodules submode to the kernel module selection of a software image CMDaemon automatically runs the createramdisk command The createramdisk command regen erates the ramdisk inside the image and sends the updated image to all provisioning nodes The createramdisk command can also be run from cmsh at any time manually by the administrator when in softwareimage mode which is useful if a kernel or modules build is done without using CMDaemon In cmgui the selection of kernel modules is done from by selecting the Software Images resource and then choosing the Kernel Config tabbed pane Figure 6 4 Bright Computing Inc 6 3 Node Installer 89 File Monitoring View Help RESOURCES default image E My Cluster 275 My Clusters vE My Cluster Gi switches E switcho1 vE Networks externalnet internalnet Y Power Distribution Units v Software Images v0 Node Categories slave vE Head Nodes E myheadnode Figure 6 4 cmgui Selecting Kernel Modules For Node Images The order of module loading can be rearranged by selecting a mod ule
179. e Figure 2 5 Hardware Overview Based On Loaded Kernel Modules Nodes Configuration The Nodes screen Figure 2 6 configures the number of racks the number of nodes the node basename the number of digits for nodes and the hardware manufacturer The maximum number of digits is 5 to keep the hostname reasonably readable The Node Hardware Manufacturer selection option initializes any monitoring parameters relevant for that manufacturer s hardware If the manufacturer is not known then Other is selected from the list Clicking Continue in this screen leads to the Network Architecture selection screen described next Bright Computing Inc 2 3 Head Node Installation 11 Bright Cluster Manager Installer Nodes m Number of racks Number of nodes Nodes Node base name Node digits Node Hardware Manufacturer eme A coea Pree Figure 2 6 Nodes Configuration Network Architecture The Network Architecture screen allows selection of one of three differ ent network architecture setups A type 1 network Figure 2 7 with nodes connected on a private internal network It is the default network setup A type 2 network Figure 2 8 with nodes connected on a public network A type 3 network Figure 2 9 with nodes connected on a routed public network Selecting the network architecture helps decide the predefined net works on the Networks settings screen later Figure 2 11 Clicking Continue her
180. e Head procedure Once a guarantee has been obtained that the active head node is pow ered off the fencing head node i e the previously passive head node moves to active mode Bright Computing Inc 13 2 HA Set Up Procedure 239 13 1 6 Quorum There is only one problem in situations where the passive head node loses its connectivity to the active head node but the active head node is doing fine communicating with the entire cluster there is no reason to initiate a failover In fact this could even result in undesirable situa tions where the cluster is rendered unusable because a passive head node might decide to power down an active head node just because the passive head node is unable to communicate with the outside world except the PDU feeding the active head node To prevent a passive head node from powering off an active head node unnecessarily the passive head node will first initiate a quorum by con tacting all nodes in the cluster The nodes will be asked to confirm that they also cannot communicate with the active head node If more than half of the total number of slave nodes confirm that they are also unable to communicate with the active head node the passive head node will initiate the STONITH procedure and move to active mode 13 1 7 Automatic vs Manual Failover Administrators have a choice between creating an HA setup with auto matic or manual failover In case of automatic failover an active head
181. e Netmask bits Base address Broadcast address externalnet 29 195 73 194 136 195 73 194 143 internalnet 16 10 142 0 0 10 142 255 255 mycluster gt network In the above example while in network mode the status command is executed in device mode and passed the argument node001 making it display the status of the node001 device The list command on the same line after the semi colon runs as expected in network mode to display a list of network objects Bright Computing Inc 42 Cluster Management with Bright Cluster Manager 3 6 3 Working With Objects Modes in cmsh work with associated objects For instance device mode works with device objects and network mode works with network ob jects The commands used to deal with objects are the same in all modes Command Description use Use the specified object I e Make the speci fied object the current object add Create the object and use it clone Clone the object and use it remove Remove the object commit Commit local changes done to an object to the cluster management infrastructure refresh Undo local changes done to the object list List all objects at current level format Set formatting preferences for list output show Display all properties of the object get Display specified property of the object set Set a specified property of the object clear Set empty value for a specified property of the object If no property is specified clear every va
182. e contacted YUM uses caches to speed up its operations Occasionally these caches may need flushing to make YUM fetch fresh copies of all index files as sociated with a repository This is done with yum clean all As an extra protection to prevent Bright Cluster Manager installations from receiving malicious updates all Bright Cluster Manager packages are signed with the Bright Computing GPG public key 0x5D849C16 installed by default in etc pki rpm gpg RPM GPG KEY cm The Bright Computing public key is also listed in Appendix B The first time YUM is used to install updates the user is asked whether the Bright Computing public key should be imported into the local RPM database Before answering with a Y the administrator may choose to compare the contents of etc pki rpm gpg RPM GPG KEY cm with the key listed in Appendix B to verify its integrity Alternatively the key may also be imported into the local RPM database directly the following com mand is used rpm import etc pki rpm gpg RPM GPG KEY cm 9 3 Managing Packages Inside Images Installing or updating packages inside a node image can be handled with rpm or yum The rpm command supports the root flag To install an RPM inside the default node image the following command is used rpm root cm images default image ivh tmp libxm12 2 6 16 6 x86_64 rpm Similarly YUM uses the installroot flag For example all pack ages in the image are updated with yu
183. e issued to device sda no 0O 100 metric await_sda ms internal The average time in milliseconds for I O requests issued to device sda to be served no 0 500 1 3 Metric Collections Output During Regular Use The output of a metric collection script without a flag is a list of outputs from the available metrics The format of each line in the list is metric lt name gt lt value gt where the parameters are metric A bare word name The name of the metric value The numeric value of the measurement Example root myheadnode metrics sample_responsiveness metric await_sda 0 00 metric util_sda 0 00 root myheadnode metrics If the output has more metrics than that suggested by when the initialize flag is used then the extra sampled data is discarded If the output has less metrics then the metrics are set to NaN not a number for the sample 1 4 Error Handling As long as the exit code of the script is 0 the framework assumes that there is no error So with the initialize flag active despite no nu meric value output the script does not exit with an error If the exit code of the script is non zero the output of the script is assumed to be a diagnostic message and passed to the head node This in turn will be shown as an event in cmsh or cmgui For example the sample_ipmi script uses the ipmi sensors binary in ternally Calling the binary directly returns an error code if the device has no IPMI configure
184. e leads to the Additional Network Configuration screen described next Bright Computing Inc 12 Installing Bright Cluster Manager anager Installer Network Architecture English US a Q Pleas the network architecture that best resembles the network architecture of your cluster High Availability setups may be configured after installation v Nodes connected on private internal network Nodes connected on public network Q Nodes connected on routed public network Network Architecture oec node002 Cancel Go Back Continue h Figure 2 7 Networks Architecture nodes connected on a private internal network Network Architecture English US g eee Sas oO k architecture that bes bles the network architecture of your cluster High jured after ins Vv Nodes connected on private internal network Nodes connected on public network Nodes connected on routed public network Network Architecture Vv Vv z node002 v Cancel Go Back j Continue Figure 2 8 Networks Architecture nodes connected on a public network Bright Computing Inc 2 3 Head Node Installation 13 Bright Cluster Manager Installer Network Architecture English US fa node001 v usar Head node Figure 2 9 Network Architecture nodes connected on a routed public network Additional Network Config
185. e new nodes The creation of the node objects means that the node names exist and so assignment to the node names is able to take place An easy way to create nodes set their provisioning interface and set their IP addresses is described in the section on the node creation wizard Section 6 6 2 The nodes are also assumed to be set for net booting The physical nodes are then powered up in an arranged order Be cause they are unknown new nodes the nodes installer keeps looping after a timeout The head node in the meantime detects the new MAC addresses and switch ports in the sequence in which they first have come up and lists them in that order By default all these newly detected nodes are set to auto which means their numbering goes up sequentially from whatever number is assigned to the preceding node in the list Thus if there are 10 new unassigned nodes that are brought into the cluster and the first node in the list is as signed to the first available number say node327 then clicking on assign automatically assigns the remaining nodes to the next available numbers say node328 node337 After the assignment the node installer looping process on the new nodes notices that the nodes are now known The node installer then breaks out of the loop and installation goes ahead without any interven tion needed at the node console 6 3 3 Starting Up All Network Interfaces At the end of section 6 3 2 the node installer knows which node i
186. e passive head node For example e RPM installations updates e Applications installed locally e Configuration file changes outside of the filesystems that are shared Bright Computing Inc 246 High Availability It is also useful to realize that when the shared storage setup was made the contents of the shared directories at that time were copied from the local filesystem to the newly created shared filesystems The shared filesystems were then mounted over the mountpoints effectively hiding the local contents Since the shared filesystems are only mounted on the active machine it is normal that the old data is still visible when a head node is operating in passive mode This is not harmful but may surprise users logging in to the passive head node For this reason logging in to a passive head node is not recommended for end users 13 3 4 High Availability Parameters There are several HA related parameters that can be tuned In the cluster management GUI this can be done through the Failover tab while select ing the cluster in the resource tree In cmsh the settings can be accessed in the failover sub mode of the base partition Example mycluster1 partition failover base mycluster1 gt partition base gt failover show Parameter Value Dead time 10 Failover network failovernet Init dead 30 Keep alive 1 Mount script cm local apps cmd scripts drbd mount sh Quorum time 60 Secondary master mycluster2
187. e provisioned lt none gt gt C Power Distribution Units E Software Images default image v0 Node Categories Provisioning node W Max number of provisioning slos wW Nodes provisioning W Nodes cure Y R slave bright51 10 0 v0 Head Nodes z E bright51 Update Provisioning Nodes Refresh b Chassis Ready Figure 6 3 cmgui A Button To Update Provisioning Nodes In examples in Section 6 1 2 changes were made to provisioning role attributes for an individual node as well as for a category of nodes The updateprovisioners command should be run after changing provisioning role settings to update images from the head node image to the provisioners according to the role settings changes and to update provisioning role changes The updateprovisioners command also runs automatically in two other cases where CMDaemon is involved during software image changes and during a provision request If on the other hand the soft ware image is changed outside of the CMDaemon frontends cmgui and cmsh for example by an administrator adding a file by copying it into place from the bash prompt then running updateprovisioners should be run manually In any case if it is not run during one of the above times there is also a scheduled time for it to run to ensure that it runs at least once every 24 hours The updateprovisioners command is in all cases subject to safe guards that prevent it running too often in a short period Appendix C
188. e top level and there fore to perform cluster management functions a user switches and de scends into the appropriate mode Figure 3 9 shows the top level commands available in cmsh These commands are displayed when help is typed in at the top level of cmsh Bright Computing Inc 40 Cluster Management with Bright Cluster Manager CONNECT we Sata ward a Sea eae diva 4 Connect to cluster disconnect seses eneke ei eee eee Disconnect from cluster alias gimeni eed Pade ee E NENEN Set aliases UNAL TAS cus se see ats bed greece ave Bees Unset aliases ORAL Gang Bikes ental Eai Sa ee Exit from current object or mode GUL Gy be ead dete ae ie he Bow ud Quit shell GXPOTE oaia atin ivtoead eee eee etna Display list of aliases current list formats help ss08 big ig aad beeen ia eels Display this help HiStOLry wcadicdocad heed eases Display command history LIS sive veer eens Peg ee yah tea es List state for all modes m dified secc ceriti wosvewiwdne ima List modified objects TOfTESH uli nada aa ndanle vasisi insi Refresh all modes EVENTS gima aani adaa Seah ads nae E Manage events TUN araisa a eb ak eaa a eee ee ee Execute cmsh commands from specified file CALECROLY reises aia bd iia ata ivan Enter category mode CET ise dea dd a eed werd art alert Enter cert mode CEVLCe vy ee edema d Ste eee Enter device mode jobqueue cee eee eee Enter jobqueue mode JODS asdair ea od Bid ii eae ee Enter jobs mode maT one
189. e user just added and the prompt shows the user name to reflect this Going into user context would otherwise be done manually by typing use user maureen at the user mode level Bright Computing Inc 7 2 Managing Users And Groups With cmsh 121 Asterisks in the prompt are a helpful reminder of a modified state with each asterisk indicating that there is an unsaved modified property at that asterisk s level The modified command displays a list of modified objects and corre sponds roughly to the functionality of the List of Changes menu option under the View menu of the main menu bar Running show at this point reveals a user name entry but empty fields for the other properties of user maureen So the account in preparation while it is modified is clearly not yet ready for use Example mycluster gt user maureen show Parameter Value Common name Group ID Home directory Login shell Password lt not set gt User ID User name maureen 7 2 2 Saving The Modified State This corresponds roughly to the functionality of the Save button opera tion in section 7 1 In section 7 2 1 above user maureen was added maureen now exists as a proposed modification but has not yet been committed to the LDAP database Running the commit command now at the maureen prompt will store the modified state at the user maureen level Example mycluster gt user maureen commit mycluster gt user maureen show
190. e version revision_cmx y architecture rpm where e package mpich ge gcc 64 is the name of the package e version 1 2 7 is the version number of the package revision 40 is the revision number of the package e x y 5 1 is the version of Bright Cluster Manager for which the RPM was built e architecture x86_64 is the architecture for which the RPM was built More information about the RPM Package Manager is available at http www rpm org 9 2 Installing amp Upgrading Packages Once Bright Cluster Manager has been installed Bright Cluster Man ager software packages can be fetched installed upgraded by fetch ing installing upgrading the corresponding RPM packages using the rpm command line utility However a more convenient way of managing packages is to use the YUM tool For example the following command lists all available packages yum list Bright Computing Inc 158 Software Image Management The following command installs a new package yum install packagename All installed packages are updated with yum update Bright Computing maintains YUM repositories at http updates bright computing com yum and updates are fetched by YUM in Bright Cluster Manager from there by default Accessing the YUM repositories manually ie not through YUM requires a username and password Authentication creden tials are provided upon request For more information on this support bright computing com should b
191. ec tion 8 3 3 Selecting the Bright Cluster Manager workload manager item from the resources tree displays tabs that let a cluster administrator change the states of e jobs e queues e nodes These tabs are described next 8 5 1 Jobs Display And Handling In cmgui Selecting the Jobs tab displays a list of job IDs along with the scheduler user queue and status of the job Figure 8 5 Eile Monitoring View Help RESOURCES 32 Workload Management E Brights Cluster v0 Head Nodes E jos bright51 A Modified Job ID vi Scheduler vi User vy Queue Stat vR gt E Chassis 301 d i a v Gi Slave Nodes ae a m m all q a E nodeoo1 Si sge te i aero q E node002 8 sge nid aero q r b E GPU units BEM ae u aero q r gt other Devices or an ee aw b E Node Groups A z i qw Users amp Groups ms maui qw 305 sge maud qw LA Monitoring Configuration zos sge qw E 5 Authorization Sge a mad qaw _ B Authentication ow o Hol elease suspen Refresh Figure 8 5 Workload Manager Jobs Within the tabbed pane e The Show button allows further details of a selected job to be listed e The Remove button removes selected jobs from the queue e The Hold button stops selected queued jobs from being considered for running by putting them in a Hold state e The Release button releases selected queued jobs in the Hold state so that they are considered for running again e The Suspend button suspends s
192. ecv measures the number of bytes received on an interface and takes an interface name such as ethO or ethi as a parameter For CPUUser the parameter field is disallowed in the Metric tab so values here are ignored e Log length The maximum number of raw data samples that are stored for the metric 3000 by default e Sampling interval The time between samples 120s by default e Gap size The number of missing samples allowed before a null value is stored as a sample value 2 by default Threshold duration Number of samples in the threshold zone be fore a threshold event is decided to have occurred 1 by default e Options checkboxes Store If ticked the metric data values are saved to the database Note that any threshold checks are still done whether the samples are stored or not Disabled If ticked the metric script does not run and no threshold checks are done for it If Store is also ticked no value is stored Only when idle If ticked the metric script is only run when the system is idling A resource hungry metric will burden the system less this way Bright Computing Inc 180 Cluster Monitoring Eile Monitoring View Help RESOURCES bas Monitoring Configuration E My cluster i My Clusters 3 J Metric Configuration H v Slave Nodes E node001 E3 node002 E node003 v GPU units H Other Devices vi Node Groups Users amp Groups vE wy Cl
193. ed for example slave to lustre server The soft ware image is set to the Lustre server image the install bootrecord op tion is enabled and the roles option is cleared Example root mycluster cmsh mycluster 4 category mycluster gt category clone slave lustre server mycluster gt category lustre server commit mycluster gt category lustre server set softwareimage lustre server i mage mycluster gt category lustre server set installbootrecord yes mycluster gt category lustre server clear roles mycluster gt category lustre server commit Creating Lustre Server Nodes An MDS node is created with cmsh Example root mycluster cmsh mycluster device mycluster gt device add slavenode mds001 10 141 16 1 mycluster gt device mds001 set category lustre server mycluster gt device mds001 commit One or multiple OSS node s are created with cmsh Bright Computing Inc 230 Third Party Software Example root mycluster cmsh mycluster device mycluster gt device add slavenode oss001 10 141 32 1 mycluster gt device oss001 set category lustre server mycluster gt device oss001 commit After the first boot and initial installation the MDS and OSS s are con figured to boot from the local drive instead of the network to preserve locally made changes Creating The Lustre Metadata Target On the metadata server a metada
194. ed for diskless operation In diskless mode all data from the software image will be transferred into the nodes memory by the node installer The obvious advantage is the elimination of the physical disk cutting power consumption and reduc ing the chance of hardware failure On the other hand some of the nodes memory will no longer be available for user applications By default the amount of memory used for holding all file system data is unlimited This means that creating very large files could cause a node to run out of mem ory and crash If required the maximum amount of memory used for the file system can be limited This can be done by setting a maximum using the maxMemSize attribute The default value of 0 results in no limitations for the file system Note that setting a limit will not necessarily prevent the node from crashing as some processes might not deal properly with situations when there is no more free space on the filesystem lt xml version 1 0 encoding UTF 8 7 gt lt diskSetup xmlns xsi http www w3 org 2001 XMLSchema instance xsi noNamespaceSchemaLocation schema xsd gt lt diskless maxMemSize 0 gt lt diskless gt lt diskSetup gt D 8 Example Semi diskless It is also possible to mix diskless operation as described above with cer tain parts of the file system on physical disk In this example all data in 1ocal will be on the physical disk the rest will be in memory Note that when nodes operate
195. een created all nodes must be assigned an IPMI interface on this network The easiest method of doing this is to create the interface for one node device and then to clone that device sev eral times For larger clusters this can be laborious and a simple bash loop can be used to do the job instead bright51 for i 1 i lt 150 i do echo device interfaces node printf 403d i echo add ipmi ipmiO echo set network ipminet echo set ip 10 148 0 i echo commit VVVNV NVM done cmsh x x usefully echoes what is piped into cmsh The preceding loop can conveniently be replaced with the addinterface command run from within the device mode of cmsh bright51 echo device gt addinterface n node001 nodei50 ipmi ipmiO ipminet 10 148 0 1 commit cmsh x The help text for addinterface gives more details on how to use it In order to be able to communicate with the IPMI interfaces the head node also needs an interface on the IPMI network Depending on how the IPMI interfaces are physically connected to the head node the head node has to be assigned an IP address on the IPMI network one way or Bright Computing Inc 66 Configuring The Cluster another There are two possibilities for how the IPMI interface are physi cally connected e When the IPMI interfaces are connected to the primary internal net work the head node should be assigned an alias interface config ured with an IP address on t
196. efault EventBucketFilter cm local apps cmd etc eventbucket filter The EventBucketFilter directive specifies the path to the file that con tains regular expressions which will be used to filter out incoming mes sages on the event bucket LDAPHost directive Syntax LDAPHost hostname Default LDAPHost localhost The LDAPHost directive specifies the hostname of the LDAP server to con nect to for user management Bright Computing Inc 259 LDAPUser directive Syntax LDAPUser username Default LDAPUser root The LDAPUser directive specifies the username that will be used when connecting to the LDAP server LDAPPass directive Syntax LDAPPass password Default LDAPPass lt random string set during installation gt The LDAPPass directive specifies the password that will be used when connecting to the LDAP server LDAPSearchDN directive Syntax LDAPSearchDN dn Default LDAPSearchDN dc cm dc cluster The LDAPSearchDN directive specifies the Distinguished Name DN that will be used when querying the LDAP server DocumentRoot directive Syntax DocumentRoot path Default DocumentRoot cm local apps cmd etc htdocs The DocumentRoot directive specifies the directory that will be mapped to the web root of the CMDaemon The CMDaemon acts as a HTTP server and can therefore in principle also be accessed by web browsers SpoolDir directive Syntax SpoolDir path Default SpoolDir
197. efault 200 used to draw the graph For example although there may be 2000 data points available during the selected period by default only 200 are used with each of the 200 an average of 10 real data points This mechanism is especially useful for smoothing out noisier metrics to give a better overview of behavior The Refresh Rate which sets how often the graph is recreated the visual layout of the graphs which can be adjusted so that Color aspects of each graph are changed in the row of settings for that graph Each graph is deleted from its pane with the button at the end of the row of settings for that graph 10 4 Monitoring Configuration With Cmgui This section is about the configuration of monitoring for health checks and metrics along with setting up the actions which are triggered from a health check or a metric threshold check Selecting Monitoring Configuration from the resources section of cmgui makes the following tabs available Figure 10 15 Overview displays as the default Metric Configuration Health Check Configuration Metrics Bright Computing Inc 10 4 Monitoring Configuration With Cmgui 177 Eile Monitoring View Help RESOURCES X Monitoring Configuration ty ose v Slave Nodes hi Overview i x E nodeoo1 E3 nodeoo2 E3 nodeoos v0 GPU units Bother Devices v Node Groups Category W _ Metric Healthchec J Authorization B Authentication
198. efreshes the database from the exter nal LDAP Here the database is updated once a day e The credentials argument specifies the password chosen for the syncuser on the external LDAP server More on the syncrep1 directive can be found in the openldap docu mentation http www openldap org doc The configuration files must also be edited so that e The lt suffix gt and rootdn settings in slapd conf both use the cor rect lt suffix gt value as used by the provider e The lt base gt value in the etc ldap conf uses the correct lt suffix gt value as used by the provider This is set on all Bright cluster nodes Bright Computing Inc 128 User Management Finally before replication takes place the consumer database is cleared This can be done by removing all files except for the DB_CONFIG file from under the configured database directory which by default is at var lib ldap The consumer is restarted using service ldap restart This repli cates the provider s LDAP database and will continue to do so at the specified intervals 7 3 2 High Availability No External LDAP Server Case If the LDAP server is not external that is if the Bright Cluster Manager is set to its high availability configuration with its LDAP servers running internally on its own head nodes then by default LDAP services are provided from both the active and the passive node The high availability setting ensures that CMDaemon t
199. eggs gpa a oiea aA eas Enter main mode MONTCOTING i uc eects paini a ase ae Enter monitoring mode NOCWOLR sie cae a wae Rae te ace ae Enter network mode nodeg oup baie cis dadana aware aa Enter nodegroup mode partition srercrirer Meee ede sasa Enter partition mode PYOCESS oraa agaaa dead a Enter process mode Profile aos eee eee ease pees Enter profile mode SCSSLON sh raresa tes e ani ec de Enter session mode softwareimage 005 Enter softwareimage mode MUS Ts issih aaa dend ad a aed andes Enter user mode Figure 3 9 Top level commands in cmsh All levels inside cmsh provide these top level commands Passing a command as an argument to help gets details for it Example myheadnode help run Usage run echo x quit q lt filename gt lt filename2 gt Execute all commands in the given file s echo x wo eee eee eee eee Echo all commands e 6 el Rane a eee eee Exit immediately after error myheadnode In the general case invoking help at any level without an argument provides the list of top level commands followed by commands that may be used at that level list of top level commands elided in example below Example myheadnode session myheadnode gt session help Bright Computing Inc 3 6 Cluster Management Shell 41 5 Top ssssssssssssssssssssssssssssssssss Sessssssssssssssssssssssssss session FIFI Id priera nre A EA Disp
200. eir associated role after they are created The creation of queues is described in sections 8 5 2 using cmgui and 8 6 2 Bright Computing Inc 138 Workload Management using cmsh If the role for the individual non head node is set and saved then it overrides its corresponding category role In cmgui this is done by select ing the particular node device from the default Slave Nodes folder then selecting the Roles tab The appropriate workload manager client role is then configured Figure 8 3 node001 Bright 5 1 Cluster v Brignt5 1 Cluster gt E Switches gt E Networks gt Gi Power Distributi gt ij Software Images li Node Categories slave gt O Head Nodes gt E Chassis v0 Slave Nodes Z Torque Client Role Torque Server Role Slots 1 Queues 7 All default longq E node002 San shortq gt IGPU Units gt ig Other Devices gt lj Node Groups B Users amp Groups PBSPro Client Role PBSPro Server Role Workload Manag LA Monitoring Conti 7 o Authorization rez ac Ce Figure 8 3 Workload Management Role Assignment For An Individual Node A useful feature of cmgui is that the role displayed for the individual node can be toggled between the category setting and the individual set ting by clicking on the role checkbox Figure 8 4 Clicking on the Save button of the tabbed pane saves the displayed setting Torque Client Role v s
201. elected running jobs e The Resume button allows selected suspended jobs to run again e The Refresh button refreshes the screen so that the latest available jobs list is displayed 8 5 2 Queues Display And Handling In cmgui Selecting the Queues tab displays a list of queues available their associ ated scheduler and the list of nodes that use each queue Figure 8 6 Bright Computing Inc 8 5 Using cmgui With Workload Management 145 File Monitoring View Help vE Slave Nodes EA Monitoring Configuration v node001 E node002 K gt E GPU units hydraq b other Devices longq gt Gj Node Groups all q Users amp Groups default shortq Scheduler wi Nodes SR enna NO OO teins torque nodeoo2 torque node001 nodeoo2 sge node001 torque torque URCES 42 Workload Management Bright 5 1 Cluster gt chassis a Queues Modified Name ar Authorization B Authentication Remove Revert Figure 8 6 Workload Manager Queues Within the tabbed pane e The Edit button allows an existing job queue of a workload man ager to be edited The particular values that can be edited for the queue depend upon the workload manager used Figures 8 7 and 8 8 Minimum wallclock Maximal wallclock Name all q Temp directory tmp Parallel environments v mpich mpich2_mpd v mpich2_smpd mpich_gm mpich_mx mvapich openmp
202. emaining buttons Edit Add Thresholds and Consolidators open up options dialogs These options are now discussed Metric Configuration The Main Tab s Edit And Add Options The Metric Configuration tab of Figure 10 16 has Add and Edit buttons The Add button opens up a dialog to add a new metric to the list and the Edit button opens up a dialog to edit a selected metric from the list The dialogs allow logging options for a metric to be set or adjusted For example a new metric could be set for sampling by adding it to the device category from the available list of all metrics or the sampling frequency could be changed on an existing metric or an action could be set for a metric that has a tendency to flap Bright Computing Inc 10 4 Monitoring Configuration With Cmgui 179 The Edit and Add dialogs for a metric have the following options Fig ure 10 17 gt lt Edit Metric Configuration e608 Metric CPUUser Parameter Log length 3000 Sampling interval 120 Gap size 2 Threshold duration 1 Options V Store Disabled Only when idle State Flapping Figure 10 17 cmgui Monitoring Metric Configuration Edit Dialog e Metric The name of the metric e Parameter Values that the metric script is designed to handle For example the metric FreeSpace tracks the free space left in a file system and is given a mount point such as or var as a parameter the metric BytesR
203. er After package installation SGE software components are installed in cm shared apps sge current also referred to as SGE_ROOT SGE documentation is available via man pages documentation in the directory cm shared docs sge as well as at the SGE website at http wikis sun com display sungridengine Home Configuring SGE After installation and initialization SGE runs with reasonable defaults Administrators familiar with SGE can reconfigure it using the tem plate files in SGE_ROOT cm templates which define the queues host groups and parallel environments To configure the head node for use with SGE the install_qmaster wrapper script under SGE_ROOT is run To configure a node image for use with SGE the install_execd wrapper script under SGE_ROOT is run Running SGE After initialization is carried out as described in the preceding text SGE can be enabled and disabled as described in sections 8 3 1 8 3 2 The SGE workload manager runs the following two daemons 1 an sge_qmaster daemon running on the head node This handles queue submissions and schedules them according to criteria set by the administrator 2 an sge_execd execution daemon running on each compute node This accepts manages and returns the results of the jobs on the compute nodes SGE maintains several log files in cm shared apps sge current default spool Messages from the qmaster daemon are logged to cm shared apps sge current default spool messages
204. er sequence to make the current machine active Example Usage information root mycluster1 cmha Usage cm local apps cmd sbin cmha status makeactive dbreclone lt node gt Bright Computing Inc 244 High Availability Example To display failover status informtation root mycluster1 cmha status Node Status running in active master mode Failover status myclusteri gt mycluster2 backupping OK mysql OK ping OK status OK mycluster2 gt mycluster1 backupping OK mysql OK ping OK status OK The in the output indicates the head node which is currently active The status output shows 4 aspects of the HA subsystem from the per spective of both head nodes HA Status Description backupping the other head node is visible over the dedi cated failover network mysql MySQL replication status ping the other head node is visible over the pri mary management network status CMDaemon running on the other head node responds to SOAP calls Example To initiate a failover manually root mycluster2 cmha makeactive Proceeding will initiate a failover sequence which will make this node mycluster2 the active master Are you sure Y N y Your session ended because CMDaemon failover no longer master mycluster2 became active master reconnecting your cmsh 13 3 2 States The state a head node is in can be determined in three dif
205. erberos Client Assuming the Kerberos server is a different server from the LDAP server then the LDAP server on the head node should be configured as a Ker beros client The changes are implemented as follows Configuring The LDAP Server As A Kerberos Client LDAP Server Changes The etc krb5 conf file is copied from the Kerberos server onto the Bright head node On the Kerberos server the kadmin shell is entered and the LDAP server is created as a principal Example addprinc randkey host master cm cluster Here master cm cluster should match the fully qualified domain name of the LDAP server The kadmin shell is exited using the exit command On the LDAP server the kadmin shell is entered and the principal added to the keytab Example ktadd host master cm cluster As with the addprinc command master cm cluster should correspond to the LDAP server s fully qualified domain name Configuring The LDAP Server As A Kerberos Client Node Changes The procedure in the previous section is repeated for all nodes in the clus ter The easiest way is to modify the image under cm images The file etc krb5 conf is copied to cm images lt image gt etc krb5 conf For each node the following command is issued on the Kerberos server using the kadmin shell Example addprinc randkey host lt nodenumber gt cm cluster where lt nodenumber gt cm cluster represents the node hostname with lt node number gt typically taking values
206. ere netmaskbits is 25 bits in size and only the last 7 bits are zeroed In dotted quad notation this implies 128 as the last quad value i e zeroing the last 7 bits For example 192 168 3 128 When in doubt or if the preceding terminology is not understood then the values to use can be calculated using the head node s sipcalc utility To use it the IP address in CIDR format for the head node must be known When run using a CIDR address value of 192 168 3 130 25 the output is some output removed for clarity sipcalc 192 168 3 130 25 Host address 192 168 3 130 Network address 192 168 3 128 Network mask 255 255 255 128 Network mask bits 25 Broadcast address 192 168 3 255 Addresses in network 128 Network range 192 168 3 128 192 168 3 255 Running it with the b binary option may aid comprehension sipcalc b 192 168 3 130 25 Host address 11000000 Network address 11000000 Network mask 11111111 Broadcast address 11000000 Network range 11000000 11000000 Bright Computing Inc 10101000 10101000 11111111 10101000 10101000 10101000 00000011 00000011 11111111 00000011 00000011 00000011 10000010 10000000 10000000 11111111 10000000 11111111
207. ere no such system in place This is because with out resource management there is a tendency for each individual user to overexploit common resources When a workload manager is used the user submits a batch i e non interactive job to it The workload manager assigns resources to the job and checks the current availability as well as its estimates of the future availability of the cluster resources that the job is asking for The work load manager schedules and executes the job based on the assignment criteria that the administrator has set for the workload management sys tem After the job has finished executing the job output is delivered back to the user The details of job submission from a user s perspective are covered in the User Manual Installing and setting up these choices is covered in this chapter in sections 8 1 8 4 How cmgui and cmsh are used to view and handle jobs queues and node drainage is then covered in sections 8 5 8 6 The chapter finishes by giving various examples of how the workload manager can be used in Bright Cluster Manager in section 8 7 8 1 Workload Managers Choices And Installation During cluster installation a workload manager can be chosen Figure 2 17 for setting up The choices are e None e Sun Grid Engine SGE This is the default e Torque v2 4 8 and its built in scheduler e Torque v2 4 8 and the Maui scheduler e Torque v2 4 8 and the Moab scheduler e PBS Pro Installation a
208. ernel then the Lustre kernel needs to be installed with the force option Opening sys block and GRUB error messages can be ignored Example Bright Computing Inc 12 6 Lustre 229 root mycluster rpm root cm images lustre server image e e4fspr ogs root mycluster rpm root cm images lustre server image Uvh e2fs progs 1 41 10 sun2 Oredhat rhe15 x86_64 rpm root mycluster rpm root cm images lustre server image force ivh kernel 2 6 18 164 11 1 e15_lustre 2 0 0 1 x86_64 rpm root mycluster rpm root cm images lustre server image ivh lust re ldiskfs 3 2 0 2 6 18_164 11 1 e15_lustre 2 0 0 1 x86_64 rpm root mycluster rpm root cm images lustre server image ivh lust re 2 0 0 1 2 6 18_164 11 1 e15_lustre 2 0 0 1 x86_64 rpm lustre modules 2 0 0 1 2 6 18_164 11 1 e15_lustre 2 0 0 1 x86_64 rpm The kernel version is set to the Lustre kernel version for the Lustre server image Example root mycluster cd cm images lustre server image boot root mycluster ls 1 vmlinuz vmlinuz 2 6 18 164 11 1 e15_lustre 2 0 0 1 vmlinuz 2 6 18 194 17 1 e15 root mycluster cmsh mycluster 4 softwareimage mycluster gt softwareimage use lustre server image mycluster gt softwareimage lustre server image set kernelversion 2 6 18 164 11 1 e15_lustre 2 0 0 1 mycluster gt softwareimage lustre server image commit Creating The Lustre Server Category A node category is clon
209. ers H 2 1 Health Checks Table H 2 1 List Of Health Checks Name Query response is PASS FAIL UNKNOWN DeviceIsUp ManagedServices0k cmsh exports failedprejob failover ldap mounts mysql node hardware profile portchecker rogueprocess ssh2node testhealthcheck Is the device up closed or installing Are CMDaemon monitored services all OK Is cmsh available Are all filesystems as defined by the cluster management system exported Are there failed prejob health checks here yes FAIL Is all well with the failover system Can the ID of the user be looked up with LDAP Are all mounts defined in the cluster manager OK Is the status and configuration of mysql cor rect Is the specified node s hardware configura tion during health check use unchanged The options to this script are described us ing the h help option Before this script is used for health checks the specified hard ware profile is usually first saved with the s option Eg node hardware profile n node001 s hardwarenode001 Is the specified port on the specified host open for TCP default or UDP connections Are the user processes that are running legit imate ie not rogue Is passwordless ssh login from head to node working A health check script example for creating scripts or setting a mix of PASS EAIL UNKNOWN re sponses The source includes examples of environ ment v
210. ers in the new category to suit the new machine Figure 8 11 File Monitoring View Help b 59 Software Images E Node s slave ane Nodes sClone slave v 0 Chassis v Slave Nodes E node001 E node002 gt E GPU Units b E Other Devices Modified Name wi Default image vi Name gpunodes v0 Node Groups RESOURCES UA Node Categories E Bright 5 1 Cluster b Power Distribution Units Overview R Figure 8 11 Cloning A New Category Via cmgui Having cloned and saved the category called gpunodes in the example of Figure 8 11 the configuration of the category may be altered to suit the new machine perhaps by going into the settings tab and altering items there Next the queue is set for this new category gpunodes by going into the Roles tabbed pane of that category selecting the appropriate work load manager client role and queues and saving the setting Figure 8 12 Bright Computing Inc 8 7 Examples Of Workload Management Assignment 153 Eile Monitoring View Help E Bright 5 1 Cluster S slave v PBSPro Client Role M PBSPro Server Role b Gj Head Nodes v Chassis Slots 9 v Slave Nodes Queues All j 4 node001 7 workq E nodeoo2 gt E GPU Units b C Other Devices Figure 8 12 Setting A Queue For A New Category Via cmgui Finally a node in the Slave Nodes folder that is to be placed in the new gpunodes cate
211. ersion OpenCL 1 0 CUDA 3 2 1 SDK Revision 7027912 NumDevs 2 Device Tesla T10 Processor Device System Info Tesla T10 Processor Bright Computing Inc 12 6 Lustre 227 Local Time Date 19 04 20 01 13 2011 CPU Name Intel R Xeon R CPU 5130 2 00GHz of CPU processors 4 Linux version 2 6 18 194 26 1 e15 mockbuild builder10 centos org gcc version 4 1 2 20080704 Red Hat 4 1 2 48 1 SMP Tue Nov 9 12 54 20 EST 2010 PASSED 12 6 Lustre This section covers aspects of Lustre a parallel distributed filesystem which can be used for clusters After a short architectural overview of Lustre steps to set up a Lustre filesystem to work with Bright Cluster Manager are described Further details on Lustre can be found at http wiki lustre org index php Main_Page 12 6 1 Architecture There are four components per Lustre filesystem 1 One management service MGS 2 One metadata target MDT on the metadata server MDS 3 Multiple object storage target OSTs on an object storage server OSS 4 Clients that access and use the data on the Lustre filesystem The management services run on the metadata server and hold infor mation for all Lustre filesystems running in a cluster Metadata values like filenames directories permissions and file lay out are stored on the metadata target The file data values themselves are stored on the object storage targets Among the supported Lustre
212. ervicesok rogueprocess myheadnode gt monitoring gt healthchecks show The get command returns the value of an individual parameter of a particular health check object Example myheadnode gt monitoring gt healthchecks get deviceisup description Returns PASS when device is up closed or installing myheadnode gt monitoring gt healthchecks cmsh Monitoring healthchecks add use remove commit refresh mod ified set clear and validate The remaining commands in monitoring healthchecks mode add use remove commit refresh modified set clear and validate all work as outlined in the introduction to working with objects section 3 6 3 More detailed usage examples of these commands within a monitoring mode are given in Cmsh Monitoring Actions section 10 7 1 In the basic example of section 10 1 a metric script was set up from cmgui to check if thresholds were exceeded and if so to launch an action A functionally equivalent task can be set up by creating and config uring a health check because metrics and health checks are so similar in concept This is done here to illustrate how cmsh can be used to do something similar to what was done with cmgui in the basic example A start is made on the task by creating a health check object and setting its values using the monitoring healthchecks mode of cmsh The task is completed in the section on the monitoring setup mode in section 10 7 4 To start the task cmsh s ad
213. es currently being provisioned lt none gt The cmgui equivalent is accessed from the Provisioning Status tabbed pane in the Software Images resource Figure 6 3 tracking the provisioning log changes synclog For a closer look into the image file changes carried out during provision ing requests the synclog command from device mode can be used lines elided in the following output Example bright51 gt device synclog node001 Tue 11 Jan 2011 13 27 17 CET Starting rsync daemon based provisionin g Mode is SYNC sending incremental file list deleting var lib ntp etc localtime var lib ntp var run ntp sent 2258383 bytes received 6989 bytes 156232 55 bytes sec total size is 1797091769 speedup is 793 29 Tue 11 Jan 2011 13 27 31 CET Rsync completed In cmgui the equivalent output to cmsh s synclog is displayed by selecting a specific device or a specific category from the resource tree Then within the tasks tabbed pane that opens up the Provisioning Log button at the bottom right is clicked Figure 6 18 Bright Computing Inc 6 3 Node Installer 105 Eile Monitoring View Help RESOURCES Slave Nodes E Bright 5 1 Cluster vis My Clusters Overview Status Tasks Network Setup Node Identification Wizard vE Bright 5 1 Cluster gt G Networks gt E Power Distribution Units gt i Software Images Operating System Shutdown gt E Node Categories b Head
214. esponds to the cmgui metrics tab of section 10 4 4 The monitoring metrics mode of cmsh handles metrics objects in the way described in the introduction to working with objects section 3 6 3 A typical reason to handle metrics objects the properties associated with a metrics script or metrics built in might be to view the configuration metrics already being used for sampling by a device category or to add a metric for use by a device category This section goes through a cmsh session giving some examples of how this mode is used and to illustrate its behavior cmsh monitoring metrics list show and get In metrics mode the list command by default lists the names and com mand scripts available for setting for device categories Example root myheadnode cmsh myheadnode monitoring metrics myheadnode gt monitoring gt metrics list Name key Command AlertLevel lt built in gt AvgExpFactor lt built in gt Bright Computing Inc 10 7 Monitoring Modes With Cmsh AvgJobDuration lt built in gt The above shows a truncated list of the metrics that may be used for sampling on a newly installed system What these metrics do is described in appendix H 1 1 The show command of cmsh displays the parameters and values of a specified metric Example myheadnode gt monitoring gt metrics show cpuuser Parameter Value Class of metric cpu Command lt built in gt Cumulative yes Description Total
215. ests done at the time of install it uses an exclude list as detailed in Section 6 3 7 This exclude list is defined in the excludelistupdate property of the node s category The main difference between excludelistupdate from this section and excludelistsyncinstall excludelistfullinstall from Section 6 3 7 is in dicated by the parts of their names emphasized here Namely the excludelistupdate property settings concern an update to a running system while the other two are about an install during node start up The syntax for the exclude list in the update case remains the same as that of the install cases i e defined by the syntax used by rsync s exclude from option A sample cmsh one liner which opens up a text editor in a category to set the exclude list for updates for is cmsh c category use slave set excludelistupdate commit The exclude list can be edited in cmgui as indicated in Figure 6 19 File Monitoring View Help RESOURCES 4 slave E Bright 5 1 cluster 7255 My Clusters vE Bright 5 1 Cluster gt E Switches Exclude list update b E Networks f gt E Power Distribution Units f A saii b GJ Software Images 7 v jNode Categories dev 8 b CI Head Nodes Browse b Chassis v j Slave Nodes 5 E a Ready Figure 6 19 Setting up exclude lists with cmgui for node updates In addition to the paths excluded using the excludelistupdate prop erty the provisioning system automati
216. et to the MAC address of the node that means the MAC value is passed on to 3 Using this technique the power operation ON can then carry out a Wake On LAN operation on the node from the head node Setting the custompowerscriptargument can be done like this for all nodes bin bash for nodename in cmsh c device foreach get hostname do macad cmsh c device use nodename get mac cmsh c device use nodename set customscriptargument macad commit done Bright Computing Inc 5 1 Configuring Power Parameters 77 The preceding material usefully illustrates how custompowerscriptargument can be used to pass on arbitrary pa rameters for execution to a custom script However it turns out that the goal of the task can be simplified consid erably and achieved quicker by using the environment variables available in the cluster management daemon environment for this example How to do this is examined in the next section Using Environment Variables With custompowerscript Simplification of the steps needed for custom scripts in CMDaemon is often possible because there are values in the CMDaemon environment already available to the script A line such as env gt tmp env added to the start of a custom script dumps the names and values of the environment variables to tmp env for viewing One of the names is CMD_MAC and it holds the MAC address string of the node being considered So it is not necessary
217. evel services From here on the boot process continues as if the machine was started from the drive just like any other regular Linux machine 6 4 Node States 6 4 1 Node States Indicating Regular Start Up Throughout the boot process the node sends several state change mes sages to the head node CMDaemon During a successful boot process the node goes through the following states e INSTALLING This state is entered as soon as the node installer has determined on which node the node installer is running e INSTALLER_CALLINGINIT This state is entered as soon as the node installer has handed over control to the local init process e UP This state is entered as soon as the CMDaemon of the node con nects to the head node CMDaemon These states can be seen in the event viewer pane in cmgui or in the console within cmsh with messages indicating the name of the node that WM is in the Installing Calling Init or Up state 6 4 2 Node States Indicating Problems Several other node states are used to indicate problems in the boot pro cess e INSTALLER_FAILED This state is entered from the INSTALLING state when the node installer has detected an unrecoverable problem dur ing the boot process For instance it can not find the local drive a network interface could not be started etc This state can also be en tered from the INSTALLER_CALLINGINIT state when the node takes too long to enter the UP state This could
218. ferent ways 1 By looking at the message being displayed at login time Example Bright Computing Inc 13 3 Managing HA 245 2 By executing cmha status Example root mycluster cmha status Node Status running in active master mode 3 By examining var spool cmdaemon state There are a number of possible states that a head node can be in State Description INIT Head node is initializing FENCING Head node is trying to determine whether it should try to become active ACTIVE Head node is in active mode PASSIVE Head node is in passive mode BECOMEACTIVE Head node is in the process of becoming ac tive BECOMEPASSIVE Head node is in the process of becoming pas sive UNABLETOBECOMEACTIVE Head node tried to become active but failed ERROR Head node is in error state due to unknown problem Esspecially when developing custom mount and unmount scripts it is quite possible for a head node to go into the UNABLETOBECOMEACTIVE state This generally means that the mount and or unmount script are not work ing properly or are returning incorrect exit codes To debug these situa tions it can be helpful to examine the output in var log cmdaemon The cmha makeactive command can be used to instruct a head node to be come active again 13 3 3 Keeping Head Nodes in Sync It is important that relevant filesystem changes outside of the shared di rectories that are made to active head node are also made on th
219. ffix gt W f syncuser ldif This prompts for the root password configured in slapd conf To verify syncuser is in the LDAP database the output of ldapsearch can be checked ldapsearch x sn syncuser Bright Computing Inc 7 3 Using An External LDAP Server 127 To allow access to the userPassword attribute for syncuser the fol lowing lines in slapd conf are changed from access to attrs userPassword by self write by anonymous auth by none to access to attrs userPassword by self write by dn cn syncuser lt suffix gt read by anonymous auth by none Provider configuration is now complete and the server can be restarted using etc init d ldap restart External LDAP Server Replication Configuring The Consumer s The consumer is an LDAP server on a Bright head node It is configured to replicate with the provider by adding the following lines to cm local apps openldap etc slapd conf syncrepl rid 2 provider ldap external ldap server type refreshOnly interval 01 00 00 00 searchbase lt suffix gt scope sub schemachecking off binddn cn syncuser lt suffix gt bindmethod simple credentials secret Here e The rid 2 value is chosen to avoid conflict with the rid 1 setting used during high availability configuration see section 7 3 2 e The provider argument points to the external LDAP server e The interval argument format DD HH MM 5SS specifies the time interval before the consumer r
220. flexible and powerful high level interface for the netfilter packet filtering framework inside the 2 4 and 2 6 Linux kernels Behind the scenes Shorewall uses the standard iptables command to configure netfilter in the kernel All aspects of firewall and gateway configuration are handled through the configura tion files located in etc shorewall Shorewall does not run as a daemon process but rather exits immediately after configuring netfilter through iptables After modifying Shorewall configuration files Shorewall must be run again to have the new configuration take effect Bright Computing Inc 216 Third Party Software service shorewall restart In the default setup Shorewall provides gateway functionality to the internal cluster network on the first network interface eth0O This net work is known as the nat zone to Shorewall The external network i e the connection to the outside world is assumed to be on the second net work interface eth1 This network is known as the net zone in Shore wall The interfaces file is generated by the cluster management dae mon Shorewall is configured by default through etc shorewall policy to deny all incoming traffic from the net zone except for the traffic that has been explicitly allowed in etc shorewall rules Providing a sub set of the outside world with access to a service running on a cluster can be accomplished by creating appropriate rules in etc shorewall rules
221. g the following command as root module initadd shared Other modules can also be loaded automatically with module initadd at login using the full path specification More details on the modules environment from an administrator s perspective are given in section 12 1 or in the manual page man 1 module 3 3 Authentication 3 3 1 Changing Administrative Passwords On The Cluster How to set up or change regular user passwords is not discussed here but in Chapter 7 on user management Amongst the administrative passwords associated with the cluster are 1 The root password of the head node This allows a root login to the head node 2 The root password of the node images This allows a root login to a regular node and is stored in the image file 3 The root password of the node installer This allows a root login to the node when the node installer a stripped down operating sys tem is running The node installer stage prepares the node for the final operating system when the node is booting up Section 6 3 discusses the node installer in more detail 4 The root password of mysql This allows a root login to the mysql server 5 The administrator certificate password This decrypts the root admin pfx file so that the administrator certificate can be presented to cmdaemon when administrator tasks require running Section 3 3 2 discusses certificates in more detail Bright Computing Inc 30 Cluster Manageme
222. ge mycluster gt softwareimage clone default image lustre client image mycluster gt softwareimage lustre client image commit The RPM packages can be downloaded from the Lustre website The same Lustre version which is used for the Lustre servers must be used for the Lustre clients The RPM packages to download are e kernel Lustre patched kernel e kernel modules Lustre modules client and server for Lustre patched kernel e lustre Lustre userland tools client and server for Lustre patched kernel If the Lustre kernel has a lower version number than the installed ker nel then the Lustre kernel needs to be installed with the force option Opening sys block and GRUB error messages can be ignored Example root mycluster rpm root cm images lustre client image force ivh kernel 2 6 18 164 11 1 e15_lustre 2 0 0 1 x86_64 rpm root mycluster rpm root cm images lustre client image ivh lust re 2 0 0 1 2 6 18_164 11 1 e15_lustre 2 0 0 1 x86_64 rpm lustre modules 2 0 0 1 2 6 18_164 11 1 e15_lustre 2 0 0 1 x86_64 rpm Any lIdiskfs warnings can be ignored since Idiskfs is not used by a Lustre client The kernel version used is set for the Lustre image to the Lustre kernel Example root mycluster cd cm images lustre client image boot root mycluster ls 1 vmlinuz vmlinuz 2 6 18 164 11 1 e15_lustre 2 0 0 1 vmlinuz 2 6 18 194 17 1 e15 root mycluster cmsh mycluster softwareima
223. ge mycluster gt softwareimage use lustre client image mycluster gt softwareimage lustre client image set kernelversion 2 6 18 164 11 1 e15_lustre 2 0 0 1 mycluster gt softwareimage lustre client image commit Bright Computing Inc 12 6 Lustre 233 Creating The Lustre Client Image Method 3 This method describes how to create a Lustre client image by building Lustre from source As a starting point image for a Lustre client image a clone is made of the existing software image for example from default image A clone software image is created via cmgui Figure 12 1 or using cmsh on the head node Example root mycluster cmsh mycluster 4 softwareimage mycluster gt softwareimage clone default image lustre client image mycluster gt softwareimage lustre client image commit The source package can be downloaded from the Lustre website The same Lustre version used for Lustre servers is used for the Lustre clients Instead of selecting a Linux distribution and architecture a source package to download is chosen e lustre lt version gt tar gz Lustre source code The source file is copied to the image Example root mycluster mage usr src cp lustre 2 0 0 1 tar gz cm images lustre client i If the kernel devel package is not installed on the client image it is first installed so that the kernel can be compiled root mycluster devel root mycluster r
224. ge to the node so that contents on the local drive persist from the previous boot If however the partition or filesystem does not match the stored configuration a FULL image sync is triggered install mode s execution modes Execution of an install mode setting is possible in several ways both per manently or just temporarily for the next boot Execution can be set to ap ply to categories or individual nodes The node installer looks for install mode execution settings in this order 1 The New node installmode property of the node s category This decides the install mode for a node that is detected to be new It can be set using cmgui Figure 6 13 Eile Monitoring View Help RESOURCES i slave E Bright 5 1 Cluster externalnet Tasks Settings Services Disk Setup FS Mounts FS Exports Roles Nodes Notes internalnet Gi Power Distribution Units Software image default image M Install boot record Y0 Software Images default image v Node Categories om Slave Gj Head Nodes bright5 1 v0 Chassis Installmode AUTO x New node installmode fu Figure 6 13 cmgui Install Mode Settings Under Node Category or using cmsh with a one liner like cmsh c category use slave set newnodeinstallmode FULL commit By default the New node installmode property is set to FULL 2 The Install Mode setting as set by choosing a PXE menu option on t
225. gement right custe Mae Se E Eile Monitoring View Help RESOURCES 2 apc01 E My Custer z My Clusters jew Settings v My Cluster vi Switch outlets Rese externalnet Outlet 2 ese e oe a San aaa Gj Power Distribution Units a v Sofware Images EE aa E Terai outer lt a lt v Slave Nodes E node001 y G Other Devices e Users amp Groups Workload Management 2 Monitoring Configuration _ EVENT VIEWER Q o All Events w Time amp Cluster vi Soure Wi Message R E Figure 5 5 PDU Tasks Retrieving power status information for a group of nodes Example mycluster device power g mygroup status apc01 3 ON node003 apc01 4 OFF J node004 Figure 5 6 shows usage information for the power command power b background d delay SECONDS POWERCOMMAND power c category CATEGORY b background d delay SECONDS POWERCOMMAND power g group GROUP b background d delay SECONDS POWERCOMMAND power h chassis CHASSIS b background d delay SECONDS POWERCOMMAND power n nodes NODES b background d delay SECONDS POWERCOMMAND power p pduport PDUPORT b background d delay SECONDS POWERCOMMAND POWERCOMMAND one of on Turn power on for current device off Turn power off for current device reset Reset power for current device status Retrieve power status of the current device or all if none is selected
226. gement system in the first place Section 11 2 describes how to configure the cluster to disallow user logins to the nodes 8 3 Enabling Disabling And Monitoring Workload Managers After the corresponding workload manager package is installed and ini tialized a workload manager can be enabled or disabled by the adminis trator with cmgui or cmsh In Bright Cluster Manager 5 1 SGE can even run concurrently with Torque or with PBS Pro For ease of use the administrator can arrange it so that the skeleton file in etc skel bashrc loads only the appropriate workload manager environment module sge torque or pbspro as the preferred system wide default for a category of users Alternatively users can adjust their personal bashrc files From the cmgui or cmsh point of view a workload manager consists of e a workload manager server usually on the head node e workload manager clients usually on the compute nodes Enabling or disabling the servers or clients is then simply a matter of assigning or unassigning the role 8 3 1 Enabling And Disabling Workload Managers From cmgui The workload manager server is typically enabled from the head node af ter the corresponding workload manager package is installed and initial ized as described in the installation section for each workload manager in this chapter Enabling the server is done in cmgui by clicking on the Bright Computing Inc 8 3 Enabling Disabling And Monitoring Wo
227. gory must be placed there by changing the category value of that node in the settings tab Figure 8 13 er ox File Monitoring View Help H Bright 5 1 Cluster node002 gt C Power Distribution Units gt E Software Images Node Categories opunodes slave b Gj Head Nodes v Chassis v0 Slave Nodes E node001 Category gpunodes 0 Install boot record Next boot installmode x node002 gt E GPU Units b CJ Other Devices Figure 8 13 Setting A New Category To A Slave Node Via cmgui 8 7 2 Setting Up A Prejob Healthcheck How It Works Health checks section 10 2 4 by default run as scheduled tasks over reg ular intervals They can optionally be configured to run as prejob health checks that is before a job is run If the response to a prejob health check is PASS then it shows that the node is displaying healthy behavior for that particular health aspect If the response to a prejob health check is FAIL then it implies that the node is unhealthy at least for that aspect A consequence of this may be that a job submitted to the node may fail or may not even be able to start To disallow passing a job to such unhealthy nodes is therefore a good policy and so for a cluster in the default configuration the action section 10 2 2 taken defaults to putting the node in a Drained state sec tions 8 5 3 and 8 6 3 with Bright
228. gui Selecting Users amp Groups from the Resources tree within cmgui see Fig ure 7 1 will by default list the LDAP object entries for regular users These entries are clickable and can be managed further By default there will already be one user on a newly installed sys tem cmsupport This is used to run various diagnostics utilities in Bright Cluster Manager and should not be modified The following five buttons are available to manipulate the entries in File Monitoring View Help RESOURCES A Users amp Groups E My Cluster vis My Clusters a ve I My Cluster Modified Name wi Full Name w UseriD wi Home directory vi amp gt E Switches gt Gj Networks Cmsupport emsupport 500 fome cmsupport gt 9 Power Distribution Units gt O Software Images gt Node Categories gt 0 Head Nodes gt E Chassis gt Q Slave Nodes gt E GPU Units b Other Devices b C Node Groups Users amp Groups Workload Management i Remove ne Figure 7 1 cmgui User Management Bright Computing Inc 120 User Management the Users amp Groups resource pane 1 Add allows users to be added via a dialog These additions can be committed via the Save button 2 Save saves the as yet uncommitted Add or Edit operations When saving an addition e User and group ID numbers are automatically assigned ac cording to the policy of the underlying Linux distribution
229. h img chmod 644 cm images default image boot bios flash img An entry is added to the PXE boot menu to allow the DOS image to be selected This can easily be achieved by modifying the contents of cm images default image boot bios menu conf which is by default included automatically in the PXE menu By default one entry Example is included in the PXE menu which is however invisible as a result of the MENU HIDE option Removing the MENU HIDE line will make the BIOS flash option selectable Optionally the LABEL and MENU LABEL may be set to an appropriate description The option MENU DEFAULT may be added to make the BIOS flash image the default boot option This is convenient when flashing the BIOS of many nodes Example Bright Computing Inc 11 6 Hardware Match Check 213 LABEL FLASHBIOS KERNEL memdisk APPEND initrd bios flash img MENU LABEL Flash BIOS MENU HIDE MENU DEFAULT The bios menu conf file may contain multiple entries corresponding to several DOS images to allow for flashing of multiple BIOS versions or configurations 11 6 Hardware Match Check Often a large number of identical nodes may be added to a cluster In such a case it is a good practice to check that the hardware matches what is expected This can be done easily as follows 1 The new nodes say node129 to node255 are committed to a newly created category newnodes as follows output truncated root bright51 cmsh c category add new
230. h checks Properties of health checks related to the evaluating these states can then be configured from this tab for the selected device category These properties are the config uration of the state evaluation parameters themselves for example fre quency and length of logs but also the configuration of related proper ties such as severity levels based on the evaluated state the actions to launch based on the evaluated state or the action to launch if the evalu ated state is flapping The Health Check Configuration tab is initially a blank tab until the device category is selected by using the Health Check Configuration selection box The selection box selects a device category from a list of built in categories and user defined node categories node categories are introduced in section 3 1 3 On selection the health checks of the se lected device category are listed in the Health Check Configuration tab Properties of the health checks related to the evaluation of states are only available for configuration and manipulation after the health checks list is displayed Handling health checks in this manner via groups of de vices is slightly awkward for just a few machines but for larger clusters it keeps administration scalable and thus manageable Figure 10 22 shows an example of the Health Check Configuration tab after All master nodes is chosen as the category Examples of other categories that could be chosen to have health checks ca
231. have the capability to run the cluster management daemon at present and so cannot itself pass on data values directly when cmsh or cmgui need them e samplingonslave default The non head node samples the metric itself State flapping count How many times the metric value must cross a threshold within the last 12 samples before it is decided that it is in a flapping state Default value is 7 Timeout After how many seconds the command will give up retrying Default value is 5 seconds Valid for Which device category the metric can be used with The choices being e Slave Node Default e Master Node Also a default e Power Distribution Unit e Myrinet Switch Bright Computing Inc 296 Metrics Health Checks And Actions e Ethernet Switch e IB Switch e Rack Switch e Generic Switch e Chassis e GPU Unit Maximum the default minimum value the y axis maximum will take in graphs plotted in cmgui Minimum the default maximum value the y axis minimum will take in graphs plotted in cmgui 1To clarify the concept if maximum 3 minimum 0 then a data point with a y value of 2 is plotted on a graph with the y axis spanning from 0 to 3 However if the data point has a y value of 4 instead then it means the default y axis maximum of 3 is resized to 4 and the y axis will now span from 0 to 4 Bright Computing Inc H 2 Health Checks And Their Parameters 297 H 2 Health Checks And Their Paramet
232. he IPMI network e When the IPMI interfaces are connected to a dedicated physical net work the head node must also be physically connected to this net work A physical interface must be added and configured with an IP address on the IPMI network Example Assigning an IP address on the IPMI network to the head node using an alias interface mc gt device mc gt interfaces add alias eth0 0 mc gt device mc gt interfaces eth0O 0 set network ipminet mc gt device mc gt interfaces eth0O 0 set ip 10 148 255 254 mc gt device mc gt interfaces eth0 0 commit mc gt device mc gt interfaces eth0 0 1 Mon Dec 6 05 45 05 2010 mc Reboot required Interfaces have been modified mc gt device mc gt interfaces eth0 0 quit root mc etc init d network restart As with any change to the network setup the head node needs to be restarted to make the above change active although in this particular case restarting the network service would suffice 4 3 2 IPMI Authentication The node installer described in Chapter 6 is responsible for the initializa tion and configuration of the IPMI interface of a device In addition to a number of network related settings the node installer also configures IPMI authentication credentials By default IPMI interfaces are config ured with username ADMIN and a random password that was generated during the installation of the head node Changing the IPMI authentica
233. he Lustre web site e lustre client Lustre client userland tools client for unpatched ven dor kernel e lustre client modules Lustre client modules client for unpatched vendor kernel The same Lustre version which is used for the Lustre servers is used for the Lustre clients The kernel version of the lustre client modules package must also match that of the kernel used It is 2 6 18_164 11 1 el5 in the following example Example root mycluster ls lustre client modules lustre client modules 2 0 0 1 2 6 18_164 11 1 e15_lustre 2 0 0 1 x86_64 rpm root mycluster ls cm images lustre client image boot vmlinuz cm images lustre client image boot vmlinuz 2 6 18_164 11 1 e15 The installation can then be carried out Example root mycluster rpm root cm images lustre client image ivh lust re client 2 0 0 1 2 6 18_164 11 1 e15_lustre 2 0 0 1 x86_64 rpm lustre client modules 2 0 0 1 2 6 18_164 11 1 e15_lustre 2 0 0 1 x86_64 rpm Bright Computing Inc 232 Third Party Software Creating The Lustre Client Image Method 2 This method describes how to create a Lustre client image with a Lustre kernel package To create a starting point image for the Lustre client image a clone is made of the existing software image for example from default image A clone software image is created via cmgui Figure 12 1 or using cmsh on the head node Example root mycluster cmsh mycluster softwareima
234. he Settings tab of the node001 resource Figure 3 5 displays properties such as the hostname that can be changed The Save button on the bottom of the tab makes the changes active and permanent while the Revert button undoes all unsaved changes Bright Computing Inc 3 5 Navigating the Cluster Management GUI 37 Eile Monitoring View Help l My Cluster node001 vie My Clusters vA My Cluster va Switches E8 swichon Power ES v Networks externalnet internalnet Aj Power Distribution Units 4 apcOl Gi Software Images default image v Node Categories fun slave Add to node group lt new gt Software image vG Head Nodes Workload Management LA Monitoring Configuration 0 mop amp mom 4 mycluster a aaa a aa y DSlave Nodes node001 Reinstall node 4 node002 vE Other Devices Node Groups Misc Root Shell A Users amp Groups Identify node Locate in rack Authorisation B Authentication Provisioning Log EVENT VII Q 2 o All Events 7v Time l Cluster vi Source vi Message v R Figure 3 6 Node Tasks Figure 3 6 shows the Tasks tab of the node001 resource The tab dis plays operations that can be performed on the node001 resource Details on setting these up their use and meaning are provided in the remaining chapters of this manual It is also possible to select a resource folder rather than a resource i
235. he basic example of section 10 1 a threshold is set to 50 of CPUUser and an action set so that crossing this threshold runs the killallyes script A threshold is a particular value in a sampled metric A sample can cross the threshold thereby entering or leaving a zone that is demarcated by the threshold A threshold can be configured to launch an action section 10 2 2 ac cording to threshold crossing conditions cmgui s New Threshold dialog Figure 10 3 has three action launch configuration options 1 Enter if the sample has entered into the zone and the previous sample was not in the zone 2 Leave if the sample has left the zone and the previous sample was in the zone Bright Computing Inc 168 Cluster Monitoring 3 During if the sample is in the zone and the previous sample was also in the zone A threshold zone also has a settable severity section 10 2 6 associated with it This value is processed for the AlertLevel metric section 10 2 7 when an action is triggered by a threshold event 10 2 4 Health Check A health check value is a state It is the response to running a health check script at a regular time interval with as possible response values PASS FAIL or UNKNOWN The state is recorded in the monitoring framework Examples of health checks are e checking if the hard drive still has enough space left on it and re turning PASS if it has checking if an NFS mount is accessible and retu
236. he be havior of the role For example the SGEClient role which turns a node into an Sun Grid Engine client uses parameters to control how the node is configured within SGE in terms of queues and the number of queue slots When a role has been assigned to a node category with a certain set of parameters it is possible to override the parameters for a node inside the category This can be done by assigning the role again to the individual node with a different set of parameters Roles that have been assigned to nodes override roles that have been assigned to a node category 3 2 Modules Environment The modules environment allows users to modify their shell environment using pre defined modules A module may for example configure the user s shell to run a certain version of an application Details of the modules environment from a user perspective are dis cussed in the User Manual However some aspects of it are relevant for administrators and are therefore discussed here 3 2 1 Adding And Removing Modules Modules may be loaded and unloaded and also be combined for greater flexibility Modules that are currently installed are displayed by running module list The modules available for loading are displayed by running module avail Loading and removing specific modules is done with module load and module remove using this format module add lt MODULENAME1 gt lt MODULENAME2 gt Example Here is how to load the environ
237. he console of the node before it loads the kernel and ramdisk Fig ure 6 14 This only affects the current boot By default the PXE menu install mode option is set to AUTO Bright Computing Inc 100 Node Provisioning Automatic boot in 3 seconds Bright Computing Figure 6 14 PXE Menu With Install Mode Set To AUTO 3 The Next boot install mode property of the node configura tion This can be set using cmgui Figure 6 15 RESOURCES nodeoo1 v Head Nodes pright51 v Chassis v Slave de v c nits Gj Other Devices vE Node Groups amp Users amp Groups Management network PXE I ahel Installmode AUTO Categ Install boot record Next boot installmode Bright 5 1 Cluster E Overview Tasks Settings System Information Services Process Management Network gt Category slave Workload Management Figure 6 15 cmgui Install Mode Settings For The Node It can also be set using cmsh with a one liner like cmsh c device use node001 set nextinstallmode FULL commit The property is cleared when the node starts up again after the node installer finishes its installation tasks So it is empty unless specifically set by the administrator during the current uptime for the node 4 The install mode property set in the node configuration This can be set using cmgui Figure 6 15 or using cmsh with a one liner
238. he current node in stead then no partition checking or filesystem checking is done by the node installer If the install mode value of NOSYNC applies then if the partition and filesystem checks show no errors the node starts up without getting an image synced to it from the provisioning node If the partition and filesys tem checks show errors then the node does get a known good image synced across The node installer is capable of creating advanced drive layouts in cluding software RAID and LVM setups Some drive layout examples including documentation are given in appendix D 6 3 7 Synchronizing The Local Drive With The Software Image After having mounted the local filesystems these can be synchronized with the contents of the software image associated with the node through its category Synchronization is skipped if NOSYNC is set and takes place if install mode values of FULL or AUTO are set Synchronization is dele gated by the node installer to the CMDaemon provisioning system The node installer just sends a provisioning request to CMDaemon on the head node For an install mode of FULL or for an install mode of AUTO where the local filesystem is detected as being corrupted full provisioning is done For an install mode of AUTO where the local filesystem is healthy and agrees with that of the software image sync provisioning is done On receiving the provisioning request CMDaemon assigns the provi sioning task to one of t
239. he local drive of a provisioning node must therefore have enough space available for these images which may require changes in its disk layout 6 1 2 Provisioning Nodes Role Setup With cmsh In the following cmsh example the administrator creates a new category called misc The default category slave already exists in a newly installed cluster The administrator then assigns the role called provisioning from the list of assignable roles to nodes in the misc category As an aside from the topic of provisioning from an organizational perspective other assignable roles include monitoring login and failover The nodes in the misc category assigned the provisioning role then have default image set as the image that they provision to other nodes and have 20 set as the maximum number of other nodes to be provisioned simultaneously some text is elided in the following example Example mycluster category add misc Bright Computing Inc 6 1 Provisioning Nodes mycluster gt category misc roles mycluster gt category misc gt roles 4 assign provisioning mycl gt roles provisioning set allimages false mycl gt roles provisioning set images default image mycl gt roles provisioning set maxprovisioningnodes 20 mycl gt roles provisioning show Parameter Value Name provisioning Type ProvisioningRole allImages no images default image maxProvisioningNodes 20 nodegroups m
240. he nodes that are assigned the compute role This accepts manages and returns the results of jobs on the compute nodes It writes logs to the cm local apps torque current spool mom_logs directory on the compute nodes Bright Computing Inc 8 4 Configuring And Running Individual Workload Managers 143 Jobs will however not be executed unless the scheduler daemon is also running This typically runs on the head node and schedules jobs for compute nodes according to criteria set by the administrator The possible scheduler daemons for Torque are e pbs_sched if Torque s built in scheduler itself is used It writes logs to the cm shared apps torque current spool sched_logs directory e maui if the Maui scheduler is used It writes logs to cm shared apps maui current spool log e moab if the Moab scheduler is used It writes logs to cm shared apps moab current spool log 8 4 3 PBS Pro Installation Initialization And Configuration PBS Pro Installation PBS Pro can be selected for installation during Bright Cluster Manager 5 1 installation at the point when a workload manager must be selected Figure 2 17 It can also be installed later on when the cluster is already set up In either case it is offered under a 90 day trial license To install and initialize PBS Pro after Bright Cluster Manager has al ready been set up without PBS Pro then the following script with the q flag must be run cm shared apps pb
241. he provisioning nodes The node installer is no tified when image synchronization starts and also when the image syn chronization task ends whether it is completed successfully or not exclude lists excludelistsyncinstall and excludelistfullinstall Image synchronization is done using rsync What files are synchronized is decided by an exclude list An exclude list is a property of the node category and is a list of directories and files that are be excluded from consideration during synchronization The excluded list that is passed on to rsync is decided by the type of synchronization chosen full or sync e A sync type of synchronization uses the excludelistsyncinstall property to specify what files and directories to exclude from con sideration when copying over the rest of the filesystem from the known good image This list has sections of the filesystem that Bright Computing Inc 6 3 Node Installer 103 should be retained across boots such as log files On the node that is being copied to the remaining files and directories which undergo synchronization lose their original contents e A full type of synchronization uses the excludelistfullinstall property to specify what files and directories to exclude from con sideration when copying over parts of the file system from a known good image This is a small list of exclusions containing items such as proc The default list allow a full filesystem to be copied over to
242. he shell through common operators such as gt gt gt and Example mycluster device list gt devices mycluster device status gt gt devices mycluster device list grep node001 Type Hostname key MAC key Ip SlaveNode node001 00 E0 81 2E F7 96 10 142 0 1 mycluster Bright Computing Inc 3 6 Cluster Management Shell 49 Looping Over Objects With foreach It is frequently convenient to be able to execute a cmsh command on sev eral objects at once The foreach command is available in a number of cmsh modes for this purpose A foreach command takes a list of space separated object names the keys of the object and a list of commands that must be enclosed by parentheses i e and The foreach com mand will then iterate through the objects executing the list of commands on the iterated object each iteration The foreach syntax is foreach lt obp gt lt obp lt cmd lt ma gt Example mycluster gt device foreach node001 node002 get hostname status node001 node001 0 UP node002 node002 0 UP mycluster gt device With the foreach command it is possible to perform set commands on groups of objects simultaneously or to perform an operation on a group of objects For extra convenience device mode in cmsh supports a number of additional flags n g and c which can be used for selecting devices Instead of passing a list of
243. hen the action is run It is selected from a drop down choice of Enter During or Leave where Enter runs the action if the sample has entered the zone Leave runs the action if the sample has left the zone During runs the action if the sample is in the zone and the previous sample was also in the zone Metric Configuration Consolidators Options The Metric Configuration tab of Figure 10 16 also has a Consolidators button associated with the selected metric Consolidators decide how the data values are handled once the ini tial log length quantity for a metric is exceeded Data points that have become old are gathered and when enough have been gathered they are processed into consolidated data Consolidated data values present fewer data values than the original raw data values over the same time duration The aim of consolidation is to increase performance save space and keep the basic information still useful when viewing historical data The Consolidators button opens a window that displays a list of con solidators that have been defined for the selected metric Figure 10 20 Bright Computing Inc 182 Cluster Monitoring Monitoring configuration All Master Nodes Metric CPUUser Parameter name a Interval Time offset a Length a Kind a R Week 604800 0 1000 Average 600 0 000 3600 2000 g Day 86400 0 1000 Average Figure 10 20 cmgui Metric Configuration Consolidators Display
244. i openmpi_ib Prolog file NONE Epilog file NONE INFINITY INFINITY Figure 8 7 Workload Management Queues Edit Dialog For SGE Name Type Route Minimum wallclock Maximal wallclock workq Execution M 00 00 00 23 59 59 Figure 8 8 Workload Management Queues Edit Dialog For Torque And PBS Pro In the edit dialog the generic names Minimum wallclock and Maximum wallclock correspond respectively to the soft and hard wall times allowed for the jobs in the queue Specifically these are s_rt and h_rt in SGE or resources_default walltime and resources_max walltime in Torque and PBS Pro Bright Computing Inc 146 Workload Management The Prolog and Epilog files that can be specified in the dia log are scripts run before and after the job is executed How ever for SGE a default global Prolog configuration is used by Bright Cluster Manager if there is no local script in place The global configuration ensures that Bright Cluster Manager healthcheck scripts flagged as prejob scripts section 10 4 3 run as part of SGE s Prolog script Administrators creating their own Prolog file may wish to refer to the global Prolog script cm prolog under SGE_ROOT and in particular how it hooks into Bright Cluster Manager prejob checks with a call to cmprejobcheck The Prolog and Epilog scripts for Torque and PBS Pro are set up for the node images and their path ca
245. ility of such dam ages unless such copyright owner or third party has signed a writing to the contrary Table of Contents Quickstart 20 25 2 a as ee a ea eee 0 2 About ThisManual 008 0 3 Getting Administrator Level Support Introduction 1 1 What Is Bright Cluster Manager 1 2 Cluster Structure 22 ch Sane Se ed A eee BA 1 3 Bright Cluster Manager Administrator And User Environ Ment 2255 Se coe ME tah 2a AEE coe lay ty be bates a E Se coe Tones ood tik hs ae 14 Organization of This Manual Installing Bright Cluster Manager 2 1 Minimal Hardware Requirements 2 2 Supported Hardware aoaaa 2 3 Head Node Installation oaa Cluster Management with Bright Cluster Manager SD A ONCEPtS i ay a Ara Ae dy AINEA E E KE EAE ee A E a 3 2 Modules Environment sssaaa 3 3 Auth nticati One pe yee e es PS et a ale Dep ale 3 4 Cluster Management GUI 0 0 3 5 Navigating the Cluster Management GUI 3 6 Cluster Management Shell 0 000 3 7 Cluster Management Daemon 000 Configuring The Cluster 4 1 Installing a License 2 2 0 00 0 0000000 4 2 Network Settings 2 0 0 0 0 0 0000000 4 3 Configuring IPMI Interfaces 0 44 Configuring InfiniBand Interfaces 4 5 Configuring Switches and PDUs Power Management 5 1 Configuring Power P
246. ill drop into the context of an already existing object i e it will use the object The set command sets the value of each parameter displayed by a show command Example myheadnode gt monitoring gt actions killallyes set description kill all yes processes The clear command is the converse of set and removes any value for a given parameter Example myheadnode gt monitoring gt actions killallyes clear command The validate command checks if the object has all required values set to sensible values The commands refresh modified and commit work as expected from the introduction to working with objects section 3 6 3 So for example commit will only succeed if the killallyes object passes validation Example myheadnode gt monitoring gt actions killallyes validate Code Field Message 4 command command should not be empty Here validation fails because the parameter Command has no value set for it yet This is remedied with set acting on the parameter some prompt text elided for display purposes Example set command cm local apps cmd scripts actions killallyes commit 1 Validation then succeeds and the commit successfully saves the killallyes object Note that validation does not check if the script itself exists It solely does a sanity check on the values of the parameters of the object which is another issue If the killallyes script does not yet exist i
247. ill revert and commit changes use will use the specified health check making it the default for commands and validate applied to the health check will check if the health check object has sensible val ues and append and removefrom will append an action to and remove an action from a specified health check action parameter The append and removefrom commands correspond to the and widgets of cmgui in Figure 10 23 and work with parameters that can have multiple values The action killallyes was set up to be carried out with the metric CPUUser in the basic example of section 10 1 The action can also be car ried out with a FAIL response for the cpucheck health check by using append command Example healthconf cpucheck append failactions killallyes healthconf cpucheck Sending an email to root can be done by appending further Example healthconf cpucheck append failactions sendemail root healthconf cpucheck get failactions enter SendEmail root enter killallyes healthconf cpucheck Bright Computing Inc 11 Day to day Administration This chapter discusses several tasks that may come up in day to day ad ministration of a cluster running Bright Cluster Manager 11 1 Parallel Shell The parallel shell allows bash commands to be run on a group of nodes simultaneously e In cmsh it is run from device mode by using the pexec command Example bright51 gt
248. image from the base archive Bright Computing Inc 160 Software Image Management 9 5 1 Creating A Base Distribution Archive From A Base Host The step of creating the base distribution archive is done by creating an archive structure containing the files that are needed by the non head node The archive can be a convenient and standard tar gz file archive or actually taking the step a little further towards the end result the archive can be a fully expanded archive file tree For example a base distribution tar gz archive here it is cm image new image grab tgz can be created from the base host basehost64 as follows ssh root basehost64 tar cz exclude etc HOSTNAME exclude etc localtime exclude proc exclude losttfound exclude sys exclude root ssh exclude var lib dhcpcd exclude media floppy exclude etc motd exclude root bash_history exclude root CHANGES exclude etc udev rules d 30 net_persistent_names rules exclude var spool mail exclude rhn exclude etc sysconfig rhn systemid exclude var spool up2date exclude etc sysconfig rhn systemid save exclude root mbox exclude var cache yum exclude etc cron daily rhn updates gt cm image new image grab tgz Or alternatively a fully expanded archive file tree can be cre ated from basehost64 by rsyncing to an existing directory here it is cm image new image rsync av
249. in both PFX and PEM format in the following locations root cm cmgui admin pfx root cm cmsh admin pem root cm cmsh admin key The administrator password provided during Bright Cluster Manager installation encrypts the admin pfx file generated as part of the installa tion The same password is also used as the initial root password of all nodes as well as for the other passwords discussed in section 3 3 1 The GUI utility cmgui Section 3 4 connects to the head node if the user types in the password to the admin pfx file If the root login pass word to head node is changed typically by typing the unix passwd com mand in the root shell of the node then the administrator PFX password remains unchanged unless it too is changed explicitly The password of the PFX file can be changed with the passwdpfx utility This is besides the cm change passwd utility discussed in sec tion 3 3 1 The passwdpfx utility is part of cmd a module that includes CMDaemon and associated utilities Section 3 7 root mycluster module load cmd root mycluster passwdpfx Enter old password Enter new password Verify new password Password updated root mycluster If the admin pfx password is forgotten then a new admin pfx certificate can be created using a CMDaemon option root mycluster service cmd stop root mycluster cmd c secretpab5word root mycluster service cmd start Bright Co
250. in semi diskless mode the node installer will always use the excludelistfullinstall when synchronizing the software im age to memory and disk lt xml version 1 0 encoding UTF 8 gt lt diskSetup xmlns xsi http www w3 org 2001 XMLSchema instance xsi noNamespaceSchemaLocation schema xsd gt lt diskless maxMemSize 0 gt lt diskless gt Bright Computing Inc D 8 Example Semi diskless 275 lt device gt lt blockdev gt dev sda lt blockdev gt lt partition id ai gt lt size gt max lt size gt lt type gt linux lt type gt lt filesystemext3 lt filesystem gt lt mountPoint gt local lt mountPoint gt lt mountOptions gt defaults noatime nodiratime lt mountOptions gt lt partition gt lt device gt lt diskSetup gt Bright Computing Inc Example initialize And finalize Scripts The node installer executes any initialize and finalize scripts at par ticular stages of its 13 step run during node provisioning Section 6 3 These are sometimes useful when building a cluster for carrying out troubleshooting or workarounds with non standard hardware for these stages The scripts are stored in the CMDaemon database rather than in the filesystem as plain text files because they run before the init process for the node takes over They are accessible for viewing and editing e In cmgui using the Node Categories or Slave Nodes resource un der the Settings tabbed pane for the selected i
251. in the slave category This means that roles assigned to the slave category are automatically assigned to all non head nodes unless by way of exception a node is individually configured to use its own role setting instead Setting the role in Node Categories for the category slave is done by clicking on the Node Categories folder selecting the slave category and selecting the Roles tab The appropriate workload manager client role is then configured Figure 8 2 Ele Monitoring View Help E Bright 5 1 cluster gt E Switches gt E Networks Provisioning Role gt G Power Distributi gt ij Software Images v Node Categories Storage Role SGE Client Role SGE Server Role b C3 Head Nodes gt Chassis v Slave Nodes Eb node001 Z Torque Client Role Torque Server Role E node002 Sore 4 gt GPU Units gt Other Devices gt Gj Node Groups A Users amp Groups Workload Manag Queues 7 All Z default longa 7 shortq A Monitoring Conti Authorization J eZ Figure 8 2 Workload Manager Role Assignment By Category For Com pute Nodes Each workload manager client role has options that can be set for Slots and Queues Slots from SGE terminology corresponds in Bright Cluster Manager to the np setting in Torque and PBS Pro terminology and is normally set to the number of cores per node Queues with a speci fied name are available in th
252. ing uptime Collecting CMDaemon configuration Collecting node installer configuration Collecting CMDaemon database backups Collecting RPM database information for image default image Collecting base installation release Collecting CM version Collecting core trace information Collecting process information Collecting RPM database information Collecting CMDaemon log Collecting node installer log Collecting system log files log Collecting mysql log file Collecting mce log file Collecting workload management log files Collecting filesystem mount information Collecting process information Collecting license information Preparing diagnostics file Diagnostics saved in root cm cm diagnose_bright51 cm cluster 000007_27 4 11_12152 tar gz Submit diagnostics to http support brightcomputing com cm diagnose Y n Uploaded file cm diagnose_bright51 cm cluster 000007_27 4 11_12152 tar gz root bright51 Requesting Remote Support With request remote assistance The request remote assistance utility allows a Bright Computing en gineer to securely tunnel into the cluster without a change in firewall or ssh settings of the cluster For the utility to work it should be allowed to access the www and ssh ports of Bright Computing s internet servers The tool is run by the cluster administrator Example root bright51 request remote assistance This tool helps securely set up a temporary ssh tunnel to Bright Computing Inc
253. interface for example by ethtool but TCP IP packets may not be detected for example by wireshark In that case the manufacturer should be contacted to upgrade the driver e The interface may have a hardware failure In that case the interface should be replaced 6 7 2 Node installer Logging If the node manages to get beyond the PXE stage to the node installer stage then the first place to look for hints on node boot failure is usu ally the node installer log file The node installer sends logging output to syslog In a default Bright Cluster Manager syslog setup the messages end up in var log node installer on the head node Optionally ex tra log information can be written by enabling debug logging To enable debug logging change the debug field in the node installer configuration file cm node installer scripts node installer conf From the console of the booting node the log file is also accessible by pressing Alt F7 on the keyboard 6 7 3 Provisioning Logging The provisioning system sends log information to the CMDaemon log file By default this is in var log cmdaemon The image synchronization log file can be retrieved with the synclog command running from device mode in cmsh Section 6 3 7 Hints on provisioning problems are often found by looking at the tail end of the log 6 7 4 Ramdisk Cannot Start Network The ramdisk must activate the node s network interface in order to fetch the node installer To activate the ne
254. ios means that an unpreconfigured node always boots to a dialog loop requiring manual intervention during a first install scenarios 2 and 3 For subsequent boots the behavior is e If the node MAC hardware has changed scenarios 1 2 3 if the node is new and the detected port has a configuration Bright Computing Inc Node Provisioning the node automatically boots to that configuration scenario 1 else manual intervention is needed scenarios 2 3 e If the node MAC hardware has not changed scenarios 4 5 6 7 if there is no port mismatch the node automatically boots to its last configuration scenarios 4 7 else manual intervention is needed scenarios 5 6 the newnodes command New nodes that have not been configured yet can be detected using the newnodes command from within the device mode in cmsh Example bright51 gt device newnodes The following nodes in order of appearance are waiting to be assigned MAC First appeared Detected on switch port 00 0C 29 01 0F F8 Mon 14 Feb 2011 10 16 00 CET no port detected bright51 gt device These nodes can be uniquely identified by their MAC address or switch port address The port and switch to which a particular MAC address is connected can be discovered by using the showport command section 4 5 4 After confirming that they are appropriate the ethernetswitch property for the specified device can be set to the port and switch v
255. irective Syntax RandomSeedFile filename Default RandomSeedFile dev urandom The RandomSeedFile directive specifies the path to a source of random ness DHParamFile directive Syntax DHParamFile filename Default DHParamFile cm local apps cmd etc dhi024 pem The DHParamFile directive specifies the path to the Diffie Hellman pa rameters Bright Computing Inc 257 SSLHandshakeTimeout directive Syntax SSLHandshakeTimeout number Default SSLHandshakeTimeout 10 The SSLHandShakeTimeout directive controls the time out period in sec onds for SSL handshakes SSLSessionCacheExpirationTime directive Syntax SSLSessionCacheExpirationTime number Default SSLSessionCacheExpirationTime 300 The SSLSessionCacheExpirationTime directive controls the period in seconds for which SSL sessions are cached Specifying the value 0 can be used to disable SSL session caching DBHost directive Syntax DBHost hostname Default DBHost localhost The DBHost directive specifies the hostname of the MySQL database server DBPort directive Syntax DBPort number Default DBHost 3306 The DBPort directive specifies the TCP port of the MySQL database server DBUser directive Syntax DBUser username Default DBUser cmdaemon The DBUser directive specifies the username that will be used to connect to the MySQL database server DBPass directive Syntax DBPass password Default DBPass
256. is loaded lt xml version 1 0 encoding UTF 8 7 gt lt diskSetup xmlns xsi http www w3 org 2001 XMLSchema instance xsi noNamespaceSchemaLocation schema xsd gt lt device gt lt blockdev gt dev sda lt blockdev gt lt partition id ai gt lt size gt 25G lt size gt lt type gt linux lvm lt type gt lt partition gt lt device gt lt device gt lt blockdev gt dev sdb lt blockdev gt lt partition id bi gt lt size gt 25G lt size gt lt type gt linux lvm lt type gt lt partition gt lt device gt lt volumeGroup gt lt name gt vg1 lt name gt lt extentSize gt 4M lt extentSize gt lt physicalVolumes gt lt member gt al lt member gt lt member gt b1 lt member gt lt physicalVolumes gt lt logicalVolumes gt lt volume gt lt name gt voli lt name gt lt size gt 35G lt size gt lt filesystem gt ext3 lt filesystem gt lt mountPoint gt lt mountPoint gt lt mountOptions gt defaults noatime nodiratime lt mountOptions gt lt volume gt Bright Computing Inc 274 Disk Partitioning lt volume gt lt name gt vol2 lt name gt lt size gt max lt size gt lt filesystem gt ext3 lt filesystem gt lt mountPoint gt tmp lt mountPoint gt lt mountOptions gt defaults noatime nodiratime lt mountOptions gt lt volume gt lt logicalVolumes gt lt volumeGroup gt lt diskSetup gt D 7 Example Diskless This example shows how nodes can be configur
257. is not even starting PXE boot in first place e There may aa bad cable connection This can be due to moving the machine or heat creep or other physical connection problem Firmly inserting the cable into its slot may help Replacing the cable or interface as appropriate may be required e The cable may be connected to the wrong interface By default eth0 is assigned the internal network interface and eth1 the external net work interface However The two interfaces can be confused when physically viewing them and a connection to the wrong interface can therefore be made It is also possible that the administrator has changed the de fault assignment The connections should be checked to eliminate these possibilities e DHCP may not be running A check should be done to confirm that DHCP is running on the internal network interface usually eth0 root testbox ps aux grep dhcp root 4368 0 0 0 0 27680 3484 Ss AprO7 0 01 usr sbin dhcpd etho Bright Computing Inc 6 7 Troubleshooting The Node Boot Process 113 e A rogue DHCP server may be running If there are all sorts of other machines on the network the nodes are on then it is possible that there is a rogue DHCP server active on it and interfering with PXE booting Stray machines should be eliminated e Sometimes a manufacturer releases hardware with buggy drivers that have a variety of problems For instance ethernet frames may be detected at the
258. is used to create a met ric collection Figure 10 26 A metric collection is a special metric script with the following properties e It is able to return several metrics of different types when it is run not just one metric of one type like a normal metric script does hence the name metric collection e It autodetects if its associated metrics are able to run and to what extent and presents the metrics accordingly For example if the metric collection is run on a node which only has 3 CPUs running rather than a default of 4 it detects that and presents the results for just the 3 CPUs Further details on metric collections scripts are given in appendix I Because handling metric collections is just a special case of handling a metric the Add Collection button dialog is merely a restricted version of the Add button dialog Setting up a metric collection is therefore sim plified by having most of the metric fields pre filled and kept hidden For example the Class field for a metric collection would have the value Prototype in the Add button dialog while this value is pre filled and in visible in the Add Collection dialog A metric collection can be created with the Add dialog but it would be a little more laborious Whatever the method used to create the metric collection it can al ways be edited with the Edit button just like any other metric Viewing visualizations of a metric collection in cmgui is only possi ble th
259. itch port associated with it This is not considered a port mismatch but an unset switch port configuration and it typ ically occurs if switch port configuration has not been carried out whether by mistake or deliberately The node installer displays the configuration as a suggestion along with a confirmation dialog Fig ure 6 11 The suggestion can be interrupted and other node con figurations can be selected manually instead using a sub dialog By default in the main dialog the configuration is accepted after a timeout Bright Computing Inc 6 3 Node Installer 95 Figure 6 11 Scenarios Port Unset Dialog A truth table summarizing the scenarios is helpful dia Switch port con Switch port f x Node R figuration conflicts Scenario port de config known 3 with node configu tected uration ration found 1 No Yes Yes No 2 No Yes No No 3 No No No No 4 Yes Yes Yes No 5 Yes Yes Yes Yes configurations differ 6 Yes No Yes Yes port expected by MAC configura tion not found 7 Yes Yes No No port not expect ed by MAC config uration In these scenarios whenever the user manually selects a node config uration in the prompt dialog an attempt to detect an Ethernet switch port is repeated If a port mismatch still occurs it is handled by the system as if the user has not made a selection summary of behavior during hardware changes The logic of the scenar
260. ither using tab to see the possible completions or fol lowing it up with the enter key will suggest several parameters that can be set one of which is password Example mycluster gt user maureen set Usage set lt parameter gt lt value gt lt value gt Set value s of the specified parameter from the current user COMMONNAME eee eee ee eee Full user name BLOUPAT ieri Base group of this user homedirectory cece eee Home directory loginshel sarsi itikan iea eee Login shell PASSWOTd 2 e p a E Password USELIG iirpissdr wean ceee ee bee User id number USETNAME soian eec aen aaa a eee User name mycluster gt user maureen Continuing the session from the end of section 7 2 2 the password can be set at the user context prompt like this Example mycluster gt user maureen set password seteca5trOnOmy mycluster gt user maureen commit mycluster gt user maureen At this point the account maureen is finally ready for use The converse of the set command is the clear command which clears properties Example mycluster gt user maureen clear password commit Editing Groups With append And removefrom While the above commands set and clear also work with groups there are two other commands available which suit the special nature of groups These supplementary commands are append and removefrom They are used to add extra users to and remove extra users from a group Fo
261. ive dev hdc VMware Virtual IDE CDROM Drive Figure 2 16 DVD Selection Bright Computing Inc 18 Installing Bright Cluster Manager Workload Management Configuration The Workload Management configuration screen Figure 2 17 allows se lection from a list of supported workload managers A workload man agement system is highly recommended to run multiple compute jobs on a cluster To prevent a workload management system from being set up select None If a workload management system is selected then the number of slots per node can be set otherwise the slots setting is ignored If no changes are made then the number of slots defaults to the CPU count on the head node The head node can also be selected for use as a compute node which can be a sensible choice on small clusters The setting is ignored if no workload management system is selected Clicking Continue on this screen leads to the Disk Partitioning and Layouts screen described next Bright Cluster Manager Installer Workload Management m Workload management system i m Number of slots node Use head node for compute jobs he WorkLoad Management f camee oo M conme Figure 2 17 Workload Management Setup Disk Partitioning and Layouts The Disk Partitioning and Layouts configuration screen Figure 2 18 consists of two options Head node disk layout and Node disk layout For each option a partitioning layout other than the default ca
262. ive TEDT a Paar v Head Nodes ManagedSer 3000 G bright51 mounts 3000 Log length v Chassis rogueprocess 3000 minadan Gap size node002 vE GPU units Threshold duration 1 gt ij Other Devices ij Node Groups Users amp Groups Workload Management Monitoring Configuration Fail severity 10 Unknown severity 10 v Slave Nodes ssh2node 3000 Sampling interval prejob Prejob Options Authorization B Authentication i Edit Add Pemove lt Pass action Enter Ready Fail action Undrain node Emer Figure 8 14 Configuring A Prejob Healthcheck Via cmgui Configuration Using cmsh To configure a prejob health check with cmsh the healthconf submode section 10 7 4 is entered and the prejob health script object used In the following example where some text has been elided the object is the smart script Example bright52 monitoring setup healthconf default bright52 gt monitoring gt setup default gt healthconf use smart bright52 gt gt healthconf smart set checkinterval prejob set checkinterval prejob The failactions value automatically switches to enter Drain node when the value for the checkinterval parameter of the health check is set to prejob Bright Computing Inc 8 7 Examples Of Workload Management Assignment 155 Eile Monitoring View Help gt Gj Switches b GjNetworks gt Power Distribution Units gt Gi
263. k such as a company or a university network its connection behavior is determined by the settings of two objects firstly the external network object settings of the Networks resource involved and secondly by the cluster object network settings for connecting to the outside In more detail 1 The external network object configuration specifies network set tings for nodes facing the outside such as login nodes or head nodes This means that network interface particulars associated with the external network for nodes on the external network are all set here These particulars are configured in the Settings tab of the Networks resource of cmgui Figure 4 4 for the following pa rameters e the IP address parameters base address netmask gateway DHCP ranges if using DHCP e the network domain LAN domain i e what domain ma chines on this network use as their domain Bright Computing Inc 64 Configuring The Cluster e network name what the external network itself is called e and MTU size the maximum value for a TCP IP packet before it fragments on this network the default value is 1500 2 The cluster object configuration sets the other network settings the cluster uses when connecting to the outside These particulars are configured in the Settings tab of the cluster object resource in cmgui Figure 4 6 e the nameservers used by the cluster to resolve external host names e the DNS search
264. kup infrastructure is already in place at the cluster site the following open source GPL software packages may be used to maintain regular backups e Bacula Bacula is a mature network based backup program that can be used to backup to a remote storage location If desired it is also possible to use Bacula on nodes to back up relevant data that is stored on the local hard drives More information is available at http www bacula org Bright Computing Inc 11 5 BIOS Configuration and Updates 211 e rsnapshot rsnapshot allows periodic incremental file system snap shots to be written to a local or remote file system Despite its sim plicity it can be a very effective tool to maintain frequent backups of a system More information is available at http www rsnapshot org 11 5 BIOS Configuration and Updates Bright Cluster Manager includes a number of tools that can be used to configure and update the BIOS of nodes All tools are located in the cm shared apps cmbios nodebios directory on the head node The re mainder of this section assumes that this directory is the current working directory Due to the nature of BIOS updates it is highly recommended that these tools are used with great care Incorrect use may render nodes un usable Updating a BIOS of a node requires booting it from the network us ing a specially prepared DOS image From the autoexec bat file one or multiple automated BIOS operations can be perfo
265. l Operating System Internal Workload Cluster Prototype These options should not be confused with the device category that the metric can be configured for see fourth bullet point from here further on which is a property of where the metrics are applied Retrieval Method cmdaemon Metrics retrieved internally using CMDaemon default snmp Metrics retrieved internally using SNMP State flapping count default value 7 How many times the met ric value must cross a threshold within the last 12 samples a default setting set in cmd conf before it is decided that it is in a flapping state Absolute range The range of values that the metric takes A range of 0 0 implies no constraint is imposed Which device category the metric is configured for with choices out of Slave Node Master Node Power Distribution Unit Myrinet Switch Ethernet Switch IB Switch Rack Switch Generic Switch Chassis GPU Unit These options should not be confused with the class that the metric belongs to see fourth bullet point from here earlier on which is the property type of the metric Bright Computing Inc 10 4 Monitoring Configuration With Cmgui 189 Assign Figure 10 26 cmgui Monitoring Main Metrics Tab Add collection Dia log Metrics The Main Tab s Add Collection Option The Add Collection button opens a dialog which
266. l become the active Subnet Manager whereas the other instances will remain in passive mode It is recommended to run 2 Subnet Managers on all InfiniBand subnets to provide redundancy in case of failure When the head node in a cluster is equipped with an InfiniBand HCA it is a good candidate to run as a Subnet Manager The following com mand can be used to configure the Subnet Manager to be started at boot time on the head node Bright Computing Inc 68 Configuring The Cluster chkconfig opensmd on The following cmsh commands may be used to schedule the Subnet Man ager to be started on one or more nodes root mc cmsh mc device services node001 mc gt device node001 gt services add opensmd mc gt device node001 gt services opensmd set autostart yes mc gt device node001 gt services opensmd set monitored yes mc gt device node001 gt services opensmd commit mc gt device node001 gt services opensmd On large clusters it is recommended to use a dedicated node to run the Subnet Manager 4 4 3 Network Settings Although not strictly necessary it is recommended that InfiniBand inter faces are assigned an IP address i e IP over IB First a network object in the cluster management infrastructure should be created The procedure for adding a network was described in section 4 2 2 The following set tings are recommended as defaults Property Value Name
267. lave Torque Client Role Slots 1 Slots All default longq shortg Queues All Queues v default longq vV shortq SKK Figure 8 4 Workload Management Role Assignment Toggle States For An Individual Node 8 3 2 Enabling And Disabling Workload Managers From cmsh In cmsh assigning a workload manager role to a head node is done in devices mode using master as the device and assigning the workload manager role from the roles submode Example root bright51 cmsh bright51 device bright51 gt device use master bright51 gt device bright51 roles bright51 gt device bright51 gt roles assign torqueserver bright51 gt device bright51 gt roles torqueserver commit bright51 gt device bright51 gt roles torqueserver Workload manager role assignment of a node category is done using category mode using the category name and assigning a role from the roles submode Example Bright Computing Inc 8 3 Enabling Disabling And Monitoring Workload Managers 139 root bright51 cmsh bright51 category bright51 gt category use slave bright51 gt category slave roles bright51 gt category slave gt roles assign torqueclient bright51 gt category slave gt roles torqueclient commit bright51 gt category slave gt roles torqueclient For individual nodes role assignment is done via devices mode using the node name and assigning a role from
268. lay current session id killsession 0000 Kill a session DPS iG ie oe aa thas ee eee ce ee ees Provide overview of active sessions myheadnode gt session In the above example session mode is entered and help without any argument lists the possible commands at that level To enter a mode a user enters the mode name at the cmsh prompt The prompt changes to indicate that cmsh is in the requested mode and commands for that mode can then be run To leave a mode the exit command is used Example mycluster device mycluster gt device list Type Hostname MAC Ip EthernetSwitch switch01 00 00 00 00 00 00 10 142 253 1 MasterNode mycluster 00 E0 81 34 9B 48 110 142 255 254 PowerDistribut apc01 00 00 00 00 00 00 10 142 254 1 SlaveNode node001 00 E0 81 2E F7 96 10 142 0 1 SlaveNode node002 00 30 48 5D 8B C6 10 142 0 2 mycluster gt device exit mycluster A command can also be executed in a mode without entering that mode This is done by specifying the mode before the command Most commands also accept arguments after the command Multiple com mands can be executed in one line by separating commands with semi colons A cmsh input line has the following syntax lt mode gt lt cmd gt lt arg gt lt arg gt 3 lt mode gt lt cmd gt lt arg gt lt arg gt where modes and args are optional Example mycluster gt network device status node001 list node001 4 UP Nam
269. lcompliance master package Both recipes may need small modifications based on the cluster for which certification is required For the user certification run two recipes are available e recipe user ib xml e recipe user nonib xml For the privileged user certification run two recipes are available e recipe root ib xml e recipe root nonib xml When an InfiniBand interconnect is used in the cluster the recipe user ib xml and recipe root ib xml recipe should be used For clusters without InfiniBand interconnect the recipe user nonib xml and recipe root nonib xm1 should be used Throughout the recipe files several performance thresholds can be de fined which require tuning based on the hardware that is included in the cluster When in doubt it can be useful to configure values which are certainly too high or too low in case of latency After running the clus ter checker the performance thresholds can be adjusted to more realistic numbers based on the results that were obtained in practice The cluster checker can be run with the auto option for automatic configuration but this can give problems for clusters without InfiniBand interconnect A description of all test modules and parameters is available in the Intel Cluster Checker documentation at http software intel com en us cluster ready Bright Computing Inc 12 4 Intel Cluster Checker 221 Node List The nodelist file lists the nodes which should be
270. le graphs are drawn in a single graph display pane by re peating the drag and drop for different metrics For example adding the CPUIdle metric with a drag and drop to the CPUUser graph of Figure 10 7 gives a result as seen in Figure 10 9 where both graphs lie on the same axis in the top pane my headnode C PUUser my headnode C PU Idle O yo 10 Nov 2010 08 07 20 2 07 106 5 n Figure 10 9 Graph Display Pane Multiple Graphs On One Pane 10 3 3 Zooming In With Mouse Gestures Besides using a magnifying glass button there are two other ways to zoom in on a graph based on intuitive mouse gestures X Axis Zoom The first way to zoom in is to draw a horizontal line across the graph by holding the left mouse button down on the graph A guide line shows up while doing this Figure 10 10 my headnode C PUUser my headnode C PUKdle Q LTA V cai am oa aa 10 Nov 2010 08 27 19 1 35 108 09 H Figure 10 10 Graph Display Pane X axis Zoom Start The x axis range covered by this line is zoomed in on when the mouse button is released Figure 10 11 Bright Computing Inc 10 3 Monitoring Visualization With Cmgui 175 my headnode C PUUser my headnode CPUKdle ix 100 ie 10 Nov 2010 08 05 00 10 Nov 2010 08 26 00 10 Nov 2010 08 06 56 2 01 106 42 Figure 10 11 Graph Display Pane X axis Zoom Finish Box Zoom The second way to zoom i
271. ller Entire file etc ntp step tickers Node installer Entire file etc postfix main cf Node installer Section etc HOSTNAME Node installer Entire file Bright Computing Inc Bright Computing Public Key Version GnuPG v1 4 0 GNU Linux mQGiBEqt YegRBADStdQjniXxbYorXbFGncF2IcMFiNA7hamARt4w7hjtwZoKGHbC zSLsQTmgZ0 FZs tXcZa50L jGwhpxT6qhCe8Y7zIh2vwkrKlaAVK j 2PUU28vK j ip 2W 01iG HKLt ahLiCkOL3ahP0evJHh8B7e1C1rZOTKTBB6qIUbC5vHt jiwCgydm3 THLJsKnwk4qZetluTup1d0EEANCzJ1nZxZzN6ZAMkIBrct8GivWC1T1inBG4UwjHd EDcG1REJxpg OhpEP8TY 1 eOYUKRWvMqSVChPzkLUTIsd O04RGTwOPGCo6Q3TLXpM RVoonYPR1itRymPNZyW8V JeTUEnOkd1CaqZykp1sRb3 jFAiJIRCmBRc854i jRXmo foTPBACJQy oEH9Qfe3VcqR6 vR2t X91PvkxS7A5AnJIRs3Sv6yM40V 7k HrfYKt fyl 6 widtEbQ1870s4x3NYXmmne71z1nGxBfAxzPG9rt jRSXyVxc KGVd6gKeCV6d o7kS LJHRi0Lb5G4NZRF y5CGqg641i Jwp f2J4uyRbC8b LQbQ7 QnJpZ2hOIENv bXB1dGluZyBEZXZ1bG9wbWVudCBUZWFt IDxkZXZAYnJpZ2h0Y 29t cHVOaW5SnLmNv bI6IXgQTEQIAHgUCSq1h6AIbAWYLCQgHAWIDFQIDAxYCAQTeAQIXgAAKCRDvaS9m k3m0 JOOAKCOGLTZiqoCQ6TRWW2i j j ITEQ8CXACgg3040VbrG67VFzHUntcAOYTE DXW5AgOESqih6xAI AMJiaZI OEqnrhSfiMsMT3sxz3mZkrQQL82Fob7s S7nnM18 A8btPzL1K8NzZytCglrIwPCYG6vfza nkvyKEPh f2it941bh7qiu4rBLqr kGx3 zepSMRqI zW5FpIrUgDZOL9J tWSSUtPWOYQ5 jBBJrgJSLQy9dK2RhAOLUHf bOSVB JLIwNKxafkhMRwD oUNS4BiZKWyPFu47 vd8f M67 IPT1nM10iCOR QBn29MYuWnBew 61344pd I jOu3gM6YBqmRRU6yBeVi0TxxbY YnWcts6tEGALT jHUOQ7 gxVp4RDia2 jLVtbee8H464wxkkC3SSkng216RaBBAoaAykhzcAAwUH iG4WsJHFw3 CRhUqy51 jnmb1
272. lt New profiles can also be created via the profile mode of cmsh or via the Authorization resource of cmgui thus making it possible to disable auditing for arbitrary groups of CMDaemon services PublicDNS Syntax PublicDNS true lfalse Default PublicDNS false Setting the directive PublicDNS to true allows the head node to provide DNS services for any network and not just the local one LockDownDhcpd directive Syntax LockDownDhcpd truelfalse Default LockDownDhcpd false When set to true DHCP s deny unknown clients option will be set This means no new DHCP leases are granted to unknown clients In Bright 5 1 this flag is used for all networks In 5 2 particular networks can be specified MaxNumberOfProvisioningThreads directive Syntax MaxNumber0fProvisioningThreads number Default MaxNumber0fProvisioningThreads 10000 The MaxNumber0fProvisioningThreads directive specifies the cluster wide total number of nodes that can be provisioned simultaneously Indi vidual provisioning servers typically define a much lower bound on the number of nodes that may be provisioned simultaneously Bright Computing Inc 261 IpmiSessionTimeout directive Syntax IpmiSessionTimeout number Default IpmiSessionTimeout 2000 The IpmiSessionTimeout specifies the time out for IPMI calls in millisec onds SnmpSessionTimeout directive Syntax SnmpSessionTimeout number Default SnmpSessionTimeout 500000
273. lt xs complexType gt lt xs simpleContent gt lt xs extension base xs string gt lt xs attribute name name use required gt lt xs simpleType gt lt xs restriction base xs string gt lt xs pattern value a zA Z0 9 _ gt lt xs restriction gt lt xs simpleType gt lt xs attribute gt lt xs attribute name args type xs string gt lt xs extension gt lt xs simpleContent gt lt xs complexType gt lt xs element gt lt xs element name partition type partition minOccurs 1 maxOccurs unbounded gt lt xs sequence gt lt xs complexType gt lt xs complexType name partition gt lt xs sequence gt lt xs element name size type size gt lt xs element name type gt lt xs simpleType gt lt xs restriction base xs string gt lt xs enumeration value linux gt lt xs enumeration value linux swap gt lt xs enumeration value linux raid gt lt xs enumeration value linux lvm gt Bright Computing Inc 268 Disk Partitioning lt xs restriction gt lt xs simpleType gt lt xs element gt lt xs group ref filesystem minOccurs 0 maxOccurs 1 gt lt xs sequence gt lt xs attribute name id type xs string use required gt lt xs complexType gt lt xs group name filesystem gt lt xs sequence gt lt xs element name filesystem gt lt
274. lue of the object append Append a value to a specific property of the object for a property that can take more than one value at a time removefrom Remove a value from a specific property of the object for a property that can take more than one value at a time modified List objects with uncommitted local changes usedby List objects that depend on the object validate Do a validation check on the properties of the object Working with objects with these commands is demonstrated with several examples in this section Working With Objects use Example mycluster gt device use node001 mycluster gt device node001 status nodeOO1 UP mycluster gt device node001 exit mycluster gt device In the above example use node001 issued from within device mode makes node001 the current object The prompt changes accordingly The status command without an argument then returns status information just for node001 because making an object the current object makes all Bright Computing Inc 3 6 Cluster Management Shell 43 subsequent commands apply only to that object Finally the exit com mand unsets the current object Working With Objects add commit Example mycluster gt device add slavenode node100 10 141 0 100 mycluster gt device node100 category add test slave mycluster gt category test slave device use node100 mycluster gt device node100 set category test sla
275. m installroot cm images default image update With the chroot command the same result is accomplished by first chrooting into an image and subsequently executing yum or rpm com mands without root or installroot arguments The chroot command may also be used to install software which is not supplied as an RPM package into a node image For example Bright Computing Inc 9 4 Kernel Updates 159 cd cm images default image usr src tar xvzf tmp app 4 5 6 tar gz chroot cm images default image cd usr src app 4 5 6 configure prefix usr make install While chroot can be a useful tool for installing software into a node image it can have issues if it starts up daemons in the image For example installation scripts that stop and re start a system ser vice during a source install may successfully start that service within the image s chroot jail and thereby cause related unexpected changes in the image Pre and post un install scriptlets that are part of RPM packages may cause similar problems Bright RPM packages are however designed to install under chroot without issues 9 4 Kernel Updates In general it is a good idea to be careful about updating the kernel on a head node or in a node image This is particularly true when custom kernel modules are being used that were compiled against a particular kernel To prevent an automatic update of a package it is listed on the yum command line using the exclude
276. m Slave digits 3 Slave name node Time servers pool ntp org Time zone America Los_Angeles 3 6 5 Advanced cmsh Features This section describes some advanced features of cmsh and may be skipped on first reading Command Line Editing Command line editing and history features from the readline library are available See http tiswww case edu php chet readline rluserman html for a full list of key bindings The most useful features provided by readline are tab completion of commands and arguments and command history using the arrow keys Mixing cmsh And Unix Shell Commands Occasionally it can be useful to be able to execute unix commands while performing cluster management For this reason cmsh allows users to execute unix commands by prefixing the command with a character Example mycluster hostname f mycluster cm cluster mycluster Executing the command by itself will start an interactive login sub shell By exiting the sub shell the user will return to the cmsh prompt Besides simply executing commands from within cmsh the output of unix shell commands can also be used within cmsh This is done by using the backtick syntax available in most unix shells Example mycluster device use hostname mycluster gt device mycluster 4 status mycluster 4 C UP j mycluster gt device mycluster Output Redirection Similar to unix shells cmsh also supports output redirection to t
277. mal hardware requirements Head Node e Intel Xeon or AMD Opteron CPU 64 bit e 2GB RAM e 80GB diskspace e 2 Gigabit Ethernet NICs e DVD drive Compute Nodes e Intel Xeon or AMD Opteron CPU 64 bit e 1GB RAM at least 4GB recommended for diskless e 1 Gigabit Ethernet NIC 2 2 Supported Hardware The following hardware is supported Compute Nodes e SuperMicro e Cray e Dell e IBM Bright Computing Inc Installing Bright Cluster Manager e Asus e Hewlett Packard Other brands are unsupported but are also expected to work Ethernet Switches e HP Procurve e Nortel e Cisco e Dell e SuperMicro e Netgear Other brands are unsupported but are also expected to work Power Distribution Units e APC American Power Conversion Switched Rack PDU Other brands are unsupported but are also expected to work Management Controllers e IPMI 1 5 2 0 e HPiLO1 2 InfiniBand e Most InfiniBand HCAs 2 3 Head Node Installation This section describes the steps in installing a Bright Cluster Manager head node To start the install the head node is booted from the Bright Cluster Manager DVD Welcome screen The welcome screen Figure 2 1 displays version and license informa tion Two installation modes are available normal mode and express mode Selecting the express mode installs the head node with the pre defined configuration that the DVD was created with The administrator password automaticall
278. ment using version 3 2 of the Pathscale compiler and version 1 2 7 of MPICH for Gigabit Ethernet assuming these are already installed on the system An MPI application can then be compiled with this environment module add shared module add pathscale 3 2 module add mpich ge psc 64 1 2 7 mpicc o myapp myapp c Note that specifying version numbers explicitly is typically only nec essary when multiple versions of an application have been installed When there is no ambiguity module names without a further path speci fication may be used Bright Computing Inc 3 3 Authentication 29 3 2 2 Using Local And Shared Modules Applications and their associated modules are divided into local and shared groups Local applications are installed on the local file system whereas shared applications reside on a shared ie imported file system The shared module is loaded by default for ordinary users Loading it gives access to the modules belonging to shared applications and allows the module avail command to show these extra modules Loading the shared module automatically for root is not recom mended on a cluster where shared storage is not on the head node itself because root logins could be obstructed if this storage is unavailable The shared module is therefore not loaded by default for root On clusters without external shared storage root can safely load the shared module automatically at login This can be done by runnin
279. meservers name of timeservers set cmsh c partition set base timeservers hostname commit If address is set to 0 0 0 0 then the value offered by the DHCP server on the external network is accepted Space separated multiple values are also accepted for these parameters when setting the value for address or hostname Bright Computing Inc J 3 Terminology 307 J 3 Terminology A reminder about the less well known terminology in the table e netmaskbits is the netmask size or prefix length in bits In IPv4 s 32 bit addressing this can be up to 31 bits so it is a number between 1 and 31 For example networks with 256 28 addresses i e with host addresses specified with the last 8 bits have a netmask size of 24 bits They are written in CIDR notation with a trailing 24 and are commonly spoken of as slash 24 networks e baseaddress is the IP address of the network the head node is on rather than the IP address of the head node itself The baseaddress is specified by taking netmaskbits number of bits from the IP ad dress of the head node Examples A network with 256 28 host addresses This implies the first 24 bits of the head node s IP address are the network address and the remaining 8 bits are zeroed This is specified by using 0 as the last value in the dotted quad notation i e zeroing the last 8 bits For example 192 168 3 0 A network with 128 27 host addresses H
280. mi0 interface Click Continue to continue 14 If an InfiniBand network was enabled select which nodes if any are to run the subnet manager for the InfiniBand network Click Continue to continue 15 Select the DVD drive containing the Bright Cluster Manager DVD and click Continue 16 Select a workload management system and set the number of slots per node equal to the number of CPU cores per node Click Continue to continue 17 Optionally you may modify the disk layout for the head node by se lecting a pre defined layout The layout may be fine tuned by edit ing the XML partitioning definition Click Continue to continue 18 Select a time zone and optionally add NTP time servers Click Continue to continue 19 Enter a hostname for the head node Enter a password that will be used for system administration twice and click Continue 20 Configure text or graphical consoles for the nodes in the cluster 21 Review the network summary screen and click the Start button to start the installation 22 Wait until installation has completed and click Reboot F 2 First Boot 1 Ensure that the head node boots from the first harddrive by remov ing the DVD or altering the boot order in the BIOS configuration 2 Once the machine is fully booted log in as root with the password that was entered during installation Bright Computing Inc E3 Booting Nodes 283 3 Confirm that the machine is visible on the extern
281. mmand Drain node lt built in gt Power off lt built in gt Power on lt built in gt Power reset lt built in gt Reboot lt built in gt SendEmail lt built in gt Shutdown lt built in gt Undrain node lt built in gt killprocess cm local apps cmd scripts actions killprocess testaction cm local apps cmd scripts actions testaction The above shows the actions available on a newly installed system The details of what they do are covered in appendix H 3 1 The show command of cmsh displays the parameters and values of a specified action Example Bright Computing Inc 194 Cluster Monitoring myheadnode gt monitoring gt actions show poweroff Parameter Value Command lt built in gt Description Power off the device Name Power off Run on master Timeout 5 isCustom no myheadnode gt monitoring gt actions The meanings of the parameters are covered in appendix H 3 2 Tab completion suggestions with the show command suggest argu ments corresponding to names of action objects Example myheadnode gt monitoring gt actions show A double tap on the tab key to get tab completions suggestions for show in the above will display the following Example drainnode killprocess poweron reboot shutdown undrainnode killallyes poweroff powerreset sendemail testaction The Power off action name for example corresponds with the ar gument poweroff By default the arguments are the action names in l
282. mputing Inc 32 Cluster Management with Bright Cluster Manager 3 3 3 Profiles Certificates that authenticate to the cluster management infrastructure contain a profile A profile determines which cluster management oper ations the certificate holder may perform The administrator certificate is created with the admin profile which is a built in profile that allows all cluster management operations to be performed In this sense it is similar to the root account on unix systems Other certificates may be created with different profiles giving certificate owners access to a pre defined subset of the cluster management functionality Section 7 5 3 4 Cluster Management GUI This section introduces the basics of cluster management GUI cmgui This is the graphical interface to cluster management in Bright Cluster Manager It may be run on the head node or on a login node of the cluster using X11 forwarding Example user desktop gt ssh X root mycluster cmgui However more typically it is installed and run on the administrator s desktop computer This saves user discernable lag time if the user is hun dreds of kilometers away from the head node 3 4 1 Installing Cluster Management GUI To install cmgui on a desktop computer running Linux or Windows the installation package must be downloaded first These are available on any Bright Cluster Manager cluster in the directory cm shared apps cmgui dist Installation package
283. multiple interfaces one interface may be faster than the others If so it can be convenient to receive the image data via the fastest interface Setting the value of provisioninginterface which is a prop erty of the node configuration allows this By default it is set to BOOTIF transport protocol used for image data provisioningtransport The provisioning system can send the image data encrypted or unen crypted The provisioningtransport property of the node configuration can have these values e rsyncdaemon which sends the data unencrypted Bright Computing Inc 104 Node Provisioning e rsyncssh which sends the data encrypted Because encryption severely increases the load on the provisioning node using rsyncssh is only suggested if the users on the network cannot be trusted By default provisioningtransport is set to rsyncdaemon tracking the status of image data provisioning provisioningstatus The provisioningstatus command within the softwareimage mode of cmsh displays an updated state of the provisioning system As a one liner it can be run as bright51 cmsh c softwareimage provisioningstatus Provisioning subsystem status idle accepting requests Update of provisioning nodes requested no Maximum number of nodes provisioning 10000 Nodes currently provisioning 0 Nodes waiting to be provisioned lt none gt Provisioning node bright51 Max number of provisioning nodes 10 Nodes provisioning 0 Nod
284. must be run separately in order to complete installation of the li cense The install license script takes the temporary location of the new license file that was generated by request license as its argument and installs related files on the head node Running it completes the license installation on the head node Example Assuming the new certificate is saved as cert pem new root bright51 install license cm local apps cmd etc cert pem new Certificate Information Version 5 1 Edition Advanced Common name My Cluster Organization Bright Computing Inc Organizational unit Development Locality San Jose State California Country US Serial 3066 Starting date 01 Jan 2000 Bright Computing Inc Configuring The Cluster Expiration date 31 Dec 2038 MAC address 00 0C 29 87 B8 B3 Licensed nodes 2048 Is the license information correct Y n y In order to authenticate to the cluster using the Cluster Management GUI cmgui one must hold a valid certificate and a corresponding key The certificate and key are stored together in a password protected PFX a k a PKCS 12 file Please provide a password that will be used to password protect the PFX file holding the administrator certificate root cm cmgui admin pfx Password Verify password Installed new license Waiting for CMDaemon to stop OK Installing admin certificates Waiting for CMDaemon to start OK New licen
285. n a setup with DRBD both head nodes are mirroring a physical block device on each node device over a network interface This results in a virtual shared DRBD block device A DRBD block devices is effectively a Bright Computing Inc 238 High Availability simulated DAS block device DRBD is a cost effective solution for imple menting shared storage in an HA setup Custom Shared Storage The cluster management daemon on the two head nodes deals with shared storage through a mount script and an unmount script When a head node is moving to active mode it needs to acquire the shared filesystems To accomplish this the other head node first needs to relinquish any shared filesystems that may still be mounted After this has been done the head node that is moving to active mode invokes the mount script which has been configured during the HA setup procedure When an active head node is requested to become passive e g because the administrator wants to take it down for maintenance without disrupting jobs the unmount script is invoked to release all shared filesystems By customizing the mount and unmount scripts an administrator has full control over the form of shared storage that is used Also an admin istrator can control which filesystems are shared 13 1 5 Handling a Split Brain Because of the risks involved in accessing a shared filesystem simultane ously from two head nodes it is of the highest importance only one head
286. n aspects that are important to get all the hardware up and running More elaborate aspects of cluster configura tion such as power management and workload management will be cov ered in later chapters 4 1 Installing a License Any Bright Cluster Manager installation requires a license file to be present on the head node The license file specifies the conditions under which a particular Bright Cluster Manager installation has been licensed For example the name of the organization is an attribute of the license file that specifies the condition that only the specified organization may use the software Another example the maximum number of nodes is an attribute in the license file that specifies the condition that no more than the specified number of nodes may be used by the software A license file can only be used on the machine for which it has been generated and cannot be changed once it has been issued This means that to change licensing conditions a new license file must be issued The license file is sometimes referred to as the cluster certificate be cause it is the X509v3 certificate of the head node and is used throughout cluster operations Section 3 3 has more information on certificate based authentication 4 1 1 Displaying License Attributes Before starting the configuration of a cluster it is important to verify that the attributes included in the license file have been assigned the correct values The license file i
287. n be chosen by selecting it from the drop down boxes This will then be used for installation Also for each option a text editor box opens up when an option s edit button is clicked Figure 2 19 and is useful for viewing and changing values The Save and Reset buttons are enabled on editing and will save or undo the text editor changes that were made Once saved the changes cannot be reverted Clicking Continue on this screen leads to the Time Configuration screen described next Bright Computing Inc 2 3 Head Node Installation 19 Bright Cluster Manager Installer Disk Partitioning and Layouts English US fa Q t utf 7 fined list of node u and edit the disk z Head node disk layout master default xm E2 it Node disk layout sjave default xml B k Disk Layout Figure 2 18 Disk Partitioning and Layouts Bright Cluster Manager Installer Disk Partitioning and Layouts English US fy Head node disk layout master default xm X Node disk layout sjave default xml X Disk Layout wux lt typ Save fil fil a nour Jim idaem tPoint Ea ntOntior i ji 7 I Figure 2 19 Edit Head Node Disk Partitioning Time Configuration The Time Configuration screen Figure 2 20 displays a predefined list of timeservers Timeservers can be removed by selecting a timeserver from the list and clicking the button Additional timeservers can be added by entering the name of the timeserver an
288. n is to draw a box instead of a line across the graph by holding the left mouse button down and drawing a line diago nally across the data instead of horizontally A guide box shows up Fig ure 10 12 my headnode C PUUser my headnode CPUKdle ix 100 o 10 Nov 2010 08 05 00 10 Nov 2010 08 26 00 10 Nov 2010 08 18 06 1 35 106 18 Figure 10 12 Graph Display Pane Box Zoom Start This is zoomed into when the mouse button is released Figure 10 13 my headnode C PUUser my headnode C PULdle x 80 6 40 20 0 10 Nov 2010 08 09 00 10 Nov 2010 08 18 00 10 Nov 2010 08 09 43 2 41 106 58 H oga Figure 10 13 Graph Display Pane Box Zoom Finish 10 3 4 The Graph Display Settings Dialog Clicking on the settings button in the graph display pane Figure 10 8 opens up the graph display pane settings dialog Figure 10 14 Bright Computing Inc 176 Cluster Monitoring Title myheadnode CPUUser When weer 10 Nov 2010 08 04 zz 1 10 Nov 2010 08 24 z 1 hour v Intervals 200 X Refresh Rate 2 minutes z Device Metric Fill Start Color End Color Line Color myheadnode CPUUser v j E j on E eels Close Figure 10 14 Graph Display Pane Settings Dialog This allows the following settings to be modified the Title shown at the top of the graph over When the x range is displayed the Intervals value This is the number of intervals by d
289. n the loca tion given by the parameter it can be created as suggested in the basic example of section 10 1 in Setting Up The Kill Action Bright Computing Inc 196 Cluster Monitoring 10 7 2 Cmsh Monitoring Healthchecks The monitoring healthchecks mode of cmsh corresponds to the cmgui Health Checks tab of section 10 4 5 The monitoring healthchecks mode handles health check objects in the way described in the introduction to working with objects sec tion 3 6 3 A typical reason to handle health check objects the proper ties associated with an health check script or health check built in might be to view the health checks already available or to add a health check for use by a device resource This section goes through a cmsh session giving some examples of how this mode is used and to illustrate what it looks like cmsh monitoring healthchecks list show and get In monitoring healthchecks mode the list command by default lists the names of the health check objects along with their command scripts Example myheadnode gt monitoring gt healthchecks format name 18 command 55 myheadnode gt monitoring gt healthchecks list name key command DevicelIsUp lt built in gt ManagedServicesOk lt built in gt cmsh cm local apps cmd scripts healthchecks cmsh exports cm local apps cmd scripts healthchecks exports The format command introduced in section 3 6 3 is used here with the given column wid
290. nd set up of workload managers can also be done after the Bright Cluster Manager 5 1 installation Bright Computing Inc 136 Workload Management Some workload manager packages are installed by default others re quire registration from the distributor before installation During installation and set up of a workload manager package the first time the workload manager is run is when its databases must be initialized The installation and initialization procedure is described in the installation section for each workload manager in this chapter 8 2 Forcing Jobs To Run In A Workload Management System Another preliminary step is to consider forcing users to run jobs only within the workload management system Having jobs run via a work load manager is normally a best practice For convenience a Bright Cluster defaults to allowing users to login to anode and run their processes outside the workload management system without restriction For clusters with a significant load this policy results in a sub optimal use of resources since such unplanned for jobs disturb any already running jobs Disallowing user logins to nodes so that users have to run their jobs through the workload management system means that jobs on the nodes are then disturbed only according to the planning of the workload man ager If planning is based on sensible assignment criteria then resources use is optimized which is the entire aim of a workload mana
291. ngs 59 After the license is installed verifying the license attribute values is a good idea This can be done using the 1icenseinfo command in cmsh or the License tab in cmgui s cluster resource tabbed pane section 4 1 1 4 2 Network Settings After the cluster is set up with the correct license the next configuration step is to define the networks that are present During the Bright Cluster Manager installation at least two default networks were created internalnet the primary internal cluster network or management net work This is used for booting non head nodes and for all cluster management communication In the absence of other internal net works internalnet it is also used for storage and communication between compute jobs externalnet the network connecting the cluster to the outside world typically a corporate or campus network 4 2 1 Configuring Networks The network mode in cmsh gives access to all network related operations using the standard object commands See section 3 6 3 for more on cmsh modes and working with objects In cmgui a network can be configured by selecting the Networks item in the resource tree Figure 4 2 File Monitoring View Help RESOURCES Networks E My cluster v37 My Clusters lt Overview E Name W Base address Extemal W Node booting W Dynamic range start W Dynamic range end amp ie Switches externainet 0 0 0 0 7 0 0 0 0 0 0 0 0 internalnet 1
292. ngs state can be done from options in the File menu RESOURCES Y My Cluster Figure 10 6 cmgui Monitoring Window Resources View Figure 10 6 shows the different resources of the head node with the CPU resource subtree opened up in turn to show its metrics and health checks Out of these the CPUUser metric for user CPU usage is shown selected for further display To display this metric the selection is drag and dropped onto one of the 3 panes which has the text drop sensor here 10 3 2 The Graph Display Pane Figure 10 7 shows the monitoring window after such a drag and drop The graph of the metric CPUUser is displayed over 20 minutes 10th Novem ber 2010 08 04 to 08 24 On the y axis the unit used by the metric is shown 0 to about 100 This example is actually of data gathered when the basic example of 10 1 was run and shows CPUUser rising as a number of Bright Computing Inc 10 3 Monitoring Visualization With Cmgui 173 yes processes are run and falling when they end RESOURCES W My Cluster oada Evveveres 23 i i L gvo y i f I 1 1 I N 1 Figure 10 7 cmgui Monitoring Window Graph Display Pane Features of graph display panes are Figure 10 8 1 The close widget which erases all graphs on the drawing pane when it is clicked Individual graphs are removed in the settings dialog discussed in section 10 3 4 2 The
293. nistrators intending to manage a cluster with only cmgui may therefore safely skip this section Usually cmsh is invoked from an interactive session e g through ssh on the head node but it can also be used to manage the cluster from outside 3 6 1 Invoking cmsh From the head node cmsh can be invoked as follows root mycluster cmsh mycluster Running cmsh without arguments starts an interactive cluster manage ment session To go back to the unix shell a user enters quit mycluster quit root mycluster The c flag allows cmsh to be used in batch mode Commands may be separated using semi colons root mycluster cmsh c main showprofile device status apc01 admin APCOL gsciscio an nde ds UW Jj root mycluster Alternatively commands can be piped to cmsh root mycluster echo device status cmsh BPCOds seiad C UW Jj mycluster U j nodeOO1 C UW Jj nodeQ02 UU Jj switcht UU Jj root mycluster In a similar way to unix shells cmsh sources cm cmsh cmshre upon start up in both batch and interactive mode This is convenient for defin ing command aliases which may subsequently be used to abbreviate longer commands For example putting the following in cmshrc allows the ds command to be used as an alias for device status Example alias ds device status The options usage information for cmsh is obtainable with cmsh h Fig
294. nnot be altered via Bright Cluster Manager e The Add button allows a new job queue to be added to a workload manager e The Remove button removes a job queue from the workload man ager e The Revert button reverts the Queues tabbed pane to its last saved state e The Save button saves the modified Queues tabbed pane 8 5 3 Nodes Display And Handling In cmgui Selecting the Nodes tab displays a list of nodes along with their sched ulers queues and whether they are in a status of Drained or Undrained Figure 8 9 Eile Monitoring View Help RESOURCES nodeoo1 nodeoo2 gt GPU Units gt other Devices gt Node Groups Bb Users amp Groups 2 Workload Management E Bright 5 1 Cluster Refresh Figure 8 9 Node Drainage e The Drain button sets the state of a node scheduler and queue com bination to Drained The workload manager then stops jobs from starting to run for that combination e The Undrain button unsets a Drained state allowing jobs to start running for that combination e The Refresh button refreshes the screen so that the latest available state is displayed Bright Computing Inc 8 6 Using cmsh With Workload Management 147 8 6 Using cmsh With Workload Management 8 6 1 Jobs Display And Handling In cmsh jobs Mode jobs Mode In cmsh Top Level At the top level of jobs mode the administrator can view all jobs regard
295. node is in active mode at any point in time To guarantee that a head node that is about to switch to active mode will be the only head node in active mode it must either receive confirmation from the other head node that it is in passive mode or it must make sure that the other head node is powered off When the passive head node determines that the active head node is no longer reachable it must also take into consideration that there could be a communication disruption between the two head nodes This is gen erally referred to as a split brain situation Since detecting a split brain situation is impossible the passive head node may not assume that the active node is no longer up if it finds the active node to be unresponsive It is quite possible that the active head node is still up and running and observes that the passive head node has disappeared i e a split brain To resolve these situations a passive head node that notices that its active counterpart is no longer responding will first go into fencing mode While a node is fencing it will try to obtain proof that its counterpart is indeed powered off There are two ways in which such proof can be obtained a By asking the administrator to manually confirm that the active head node is indeed powered off b By performing a power off operation on the active head node and then checking that the power is indeed off This is also referred to as a STONITH Shoot The Other Node In Th
296. node is powered off when it is no longer responding and a failover se quence is initiated automatically In case of manual failover the administrator is responsible for initiat ing the failover when the active head node is no longer responding No automatic power off is done so the administrator will be asked to certify that the previously active node is powered off For automatic failover to be possible power control should be defined for both head nodes If power control has been defined for the head nodes automatic failover is used by default However it is possible to disable automatic failover In cmsh this is done by setting the disableautomaticfailover prop erty Example root bright51 cmsh bright51 partition failover base bright51 gt partition base gt failover set disableautomaticfailover yes bright51 gt partition base gt failover commit With cmgui it is done by selecting the cluster resource then selecting the Failover tab Within the tab the Disable automatic failover checkbox is ticked and the change saved with a click on the Save but ton If no power control has been defined or if automatic failover has been disabled a failover sequence must always be initiated manually by the administrator 13 2 HA Set Up Procedure After a cluster has been installed using the procedure described in chap ter 2 the administrator has the choice of running the cluster with a single head no
297. nodes commit root bright51 for i in 129 255 gt do gt cmsh c device set node00 i category newnodes commit gt done Successfully committed 1 Devices Successfully committed 1 Devices 2 The hardware profile of one of the new nodes say node129 is saved into the category newnodes This is done using the node hardware profile health check see Appendix H 2 1 as fol lows root bright51 cm local apps cmd scripts healthchecks node hardwar e profile n nodei29 s newnodes The profile is intended to be the reference hardware against which all the other nodes should match 3 The frequency with which the health check should run in normal automated periodic use is set as follows some prompt text elided root bright51 cmsh bright51 monitoring setup healthconf newnodes gt healthconf add hardware profile gt healthconf hardware profile set checkinterval 600 commit 4 The cmdaemon then automatically alerts the administrator if one of the nodes does not match the hardware of that category during the first automated check In the unlikely case that the reference node is itself faulty then that will also be obvious because all or almost all if more nodes are faulty of the other nodes in that category will then be reported faulty during the first check Bright Computing Inc 12 Third Party Software In this chapter several third party software packages included in
298. nside a real cluster The cluster partition behaves as a separate cluster while making use of the resources of the real cluster in which it is contained Although cluster partitioning is not yet possible in the current version of Bright Cluster Manager its design implications do decide how some global cluster properties are accessed through cmsh In cmsh there is a partition mode which will in a future version allow an administrator to create and configure cluster partitions Cur rently there is only one fixed partition called base The base partition represents the physical cluster as a whole and can not be removed A number of properties global to the cluster exist inside the base partition These properties are referenced and explained in remaining parts of this manual Example root myheadnode cmsh myheadnode partition use base myheadnode gt partition base show Parameter Value Administrator e mail Burn configs lt 2 in submode gt Cluster name My Cluster Default burn configuration default Default category slave Default software image External network Failover IPMI Password IPMI User ID IPMI User name Management network Masternode Name Name servers Bright Computing Inc default image externalnet not defined ek kk k k 2 ADMIN internalnet myheadnode base 192 168 101 1 48 Cluster Management with Bright Cluster Manager Rack setup 1 racks of 42 high Search domains clustervision co
299. nstallation san an Sn Sn Sk SS v Automatically reboot after installation is complete Cancel Figure 2 24 Installation Progress Bright Computing Inc 2 3 Head Node Installation 23 Bright Cluster bSanager Installer Installation Progress istallatior bs Ss Sn SS SS SS SS Installation Complete Automatically reboot after installation is complete Figure 2 25 Installation Completed After rebooting the system starts and presents a login prompt After logging in as root using the password that was set during the installa tion procedure the system is ready to be configured If express installa tion mode was chosen earlier as the install method then the password is preset to system Next in Chapter 3 some of the tools and concepts that play a central role in Bright Cluster Manager are introduced Chapter 4 then explains how to configure and further set up the cluster Bright Computing Inc Cluster Management with Bright Cluster Manager This chapter introduces cluster management with Bright Cluster Man ager A cluster running Bright Cluster Manager exports a cluster manage ment interface to the outside world which can be used by any application designed to communicate with the cluster Section 3 1 introduces a number of concepts which are key to cluster management using Bright Cluster Manager Section 3 2 gives a short introduction on how the modules environ ment can be used by adminis
300. nt with Bright Cluster Manager To avoid having to remember the disparate ways in which to change these 5 passwords the cm change passwd command runs a dialog prompt ing the administrator on which of them if any should be changed as in the following example root bright52 cm change passwd With this utility you can easily change the following passwords root password of head node root password of slave images root password of node installer root password of mysql administrator certificate for use with cmgui root admin pfx Note if this cluster has a high availability setup with 2 head nodes be sure to run this script on both head nodes Change password for root on head node y N y Changing password for root on head node Changing password for user root New UNIX password Retype new UNIX password passwd all authentication tokens updated successfully Change password for root in default image y N y Changing password for root in default image Changing password for user root New UNIX password Retype new UNIX password passwd all authentication tokens updated successfully Change password for root in node installer y N y Changing password for root in node installer Changing password for user root New UNIX password Retype new UNIX password passwd all authentication tokens updated successfully Change password for MYSQL root user y N y Changing password for MYSQL root user Old pa
301. ntel cluster runtime package is installed on both the head node and the software images These packages guarantee through package dependencies that all In tel Cluster Ready package requirements are satisfied Both packages are normally installed by default on a standard Bright Cluster Manager clus ter If they are not installed then the following commands install the com plete suite Example yum install cm config intelcompliance master intel cluster runtime yum installroot cm images default image install cm config intelcompl iance slave intel cluster runtime If yum reports that any additional packages need to be installed simply agreeing to install them is enough to satisfy the requirements The Intel Cluster Ready specification also requires the etc dat conf file The file can be copied from the etc ofed directory and has to be changed This has to be done for both the head and the software im ages The lines in the file that mention devices that are not used can be removed Example For the head node cp etc ofed dat conf etc For the default image cp cm images default image etc ofed dat conf cm images default image etc The ibstat command can be used to check if an InfiniBand device is used and if so what kind If the m1x4_0 is used the following lines are needed in the dat conf file ofa v2 ml1x4_0 1u u2 0 nonthreadsafe default libdaploucm so 2 dapl 2 0 mlx4_0 1 ofa v2 m1x4_0 2u u2 0 nonthreadsafe
302. ntification at the console The node will now be provisioned and will eventually boot In case of problems consult section 6 7 6 Optional To configure power management consult chapter 5 F 4 Running Cluster Management GUI To run the Cluster Management GUI on the cluster from a workstation running X11 1 Froma Linux desktop PC log in to the cluster with SSH X forwarding ssh Y root mycluster 2 Start the Cluster Management GUI cmgui 3 Click on the connect button see figure 3 3 and enter the password that was configured during installation Bright Computing Inc 284 Quickstart Installation Guide 4 Optional For more information on how the Cluster Management GUI can be used to manage one or more clusters consult section 3 4 To run the Cluster Management GUI on a desktop PC 1 Copy the appropriate package s from cm shared apps cmgui dist to the desktop PC scp root mycluster cm shared apps cmgui dist tmp Note On windows use e g WinSCP Copy the PFX certificate file from the cluster that will be used for authentication purposes scp root mycluster admin pfx mycluster admin pfx Install the package On Windows execute the installer and follow the steps On Linux extract using tar xvjf filename Start the cluster management GUI On Windows from the Start menu or by clicking the desktop icon On Linux change into the cmgui directory and execute cmgui Click on Add a
303. ntryCSN eq index entryUUID eq overlay syncprov syncprov checkpoint lt ops gt lt minutes gt syncprov sessionlog lt size gt The openldap documentation http www openldap org doc has more on the meanings of these directives If the values for lt ops gt lt minutes gt and lt size gt are not already set typical values are syncprov checkpoint 1000 60 and syncprov sessionlog 100 To allow the consumer to read the provider database the consumer s access rights need to be configured In particular the userPassword at tribute must be accessible LDAP servers are often configured to prevent unauthorized users reading the userPassword attribute Read access to all attributes is available to users with replication priv ileges So one way to allow the consumer to read the provider database is to bind it to replication requests Sometimes a user for replication requests already exists on the provider or the root account is used for consumer access If not a user for replica tion access must be configured A replication user syncuser with password secret can be added to the provider LDAP with adequate rights using the following syncuser ldif file dn cn syncuser lt suffix gt objectClass person cn syncuser sn syncuser userPassword secret Here lt suffix gt is the suffix set in slapd conf which is originally some thing like dc example dc com The syncuser is added using ldapadd x D cn root lt su
304. o expire in 30 days to be run with the privileges of user peter can be created with Example createcertificate 1024 democert a b c d ef readonly peter 30 home peter peterfile key home peter peterfile pem Thu Apr 14 15 10 53 2011 notice bright51 New certificate request wit h ID 1 bright51 gt cert createcertificate 1024 democert a b c d ef readonly pe ter 30 home peter peterfile key home peter peterfile pem Certificate key written to file home peter peterfile key Certificate pem written to file home peter peterfile pem Users given this certificate can then carry out cmdaemon tasks that have a read only profile and as user peter 7 5 2 Creating A New Certificate For cmgui Users In a similar way to how cmsh creates a certificate and key files in the pre ceding section cmgui users can create a certificate and a pfx file This is done via the Authentication resource of cmgui using the Certificates tab Figure 7 2 File Monitoring View Help RESOURCES Authentication Bright 5 1 Cluster b C Node Categories Certificates b gt ee a Revoked Name v Senlw Days left w Profile v County wi R assis 00 0C 29 01 0F F8 5 3853 node US gt E slave NaS 00 0C 29 80 6E 7E 4 3853 d US gt E GPU Units mena a Administrator 1 3853 admin US b E other Devices CMHealth 2 3853 cmhealth US b Node Groups Webinterface 3 3853 readonly us a Users amp Groups Workload Management A Monitoring Configuration
305. o set up the functional equivalent of the behavior of the basic example of section 10 1 In this section the task will be continued and completed and on the way how to use the health checks configuration object methods to do this will be shown First the script is added and as usual when using add the prompt drops into the level of the added object The show command acting on the object displays the following default values for its parameters some prompt text elided for display purposes Example MasterNode gt healthconf add cpucheck MasterNode gt healthconf cpucheck show Parameter Value Check Interval 120 Disabled no Fail Actions Fail severity 10 GapThreshold 2 HealthCheck cpucheck HealthCheckParam LogLength 3000 Only when idle no Bright Computing Inc 206 Cluster Monitoring Pass Actions Stateflapping Actions Store yes ThresholdDuration 1 Unknown Actions Unknown severity 10 MasterNode gt healthconf cpucheck The details of what these parameters mean is covered in section 10 4 3 where the edit and add dialog options for a health check state shown in Figure 10 23 are explained The object manipulation commands introduced in section 3 6 3 will work as expected at the healthconf prompt level in the example above add and remove will add and remove a health check set get and clear will set and get values for the parameters of each health check refresh and commit w
306. o that section which can take a while This document is therefore a quickstart document explaining how to change the IPv4 network settings while assuming no prior knowledge of Bright Cluster Manager and its network configuration interface J 2 Method A cluster consists of a head node and one or more regular nodes The head node of the cluster is assumed to face the internal network the net work of regular nodes on one interface say eth0 The external network leading to the internet is then on another interface say eth1 This is re ferred to as a type 1 configuration in the manual Typically an administrator gives the head node a static external IP address before actually connecting it up to the external network This requires logging into the physical head node with the vendor supplied root password The original network parameters of the head node can then be viewed and set For example for eth1 cmsh c device interfaces master get ethi ip 0 0 0 0 Here 0 0 0 0 means the interface accepts DHCP server supplied values Setting a static IP address value of for example 192 168 1 176 and checking the value once more cmsh c device interfaces master set ethi ip 192 168 1 176 commit cmsh c device interfaces master get ethi ip 192 168 1 176 Other external network parameters can be viewed and set in a similar way as shown in table J 1 Bright Computing Inc Changing The Network Parameters Of The Hea
307. o the past the raw log data point is removed is given by traw gone Log length ceric X Sampling interval metric This value is also the default consolidation time because the consol idated data values are normally presented from traw gone Seconds ago to further into the past The default consolidation time occurs when the Time Offset has its default zero value Bright Computing Inc 10 4 Monitoring Configuration With Cmgui 183 If however the Time Offset period is non zero then the consolida tion time is offset because the time into the past from which consol idation is presented to the user tconsolidation is then given by tconsolidation traw gone Time Offset The monitoring visualization graphs then show consolidated data from tconsolidation Seconds into the past to further into the past e Kind the kind of consolidation done on the raw data samples The output result for a processed set of raw data the consolidated data point is an average a maximum or a minimum of the input raw data values Kind can thus have the value Average Maximum or Minimum 10 4 3 Health Check Configuration The Health Check Configuration tab behaves in a similar way to the Metric Configuration tab of section 10 4 2 with some differences aris ing due to working with health checks instead of metric values The Health Check Configuration tab allows device categories to be selected for the evaluating the states of healt
308. objects to foreach directly the flags may be used to select the nodes to loop over The g and c flags take a node group and category argument respectively The n flag takes a node list argument Node lists may be specified using the following syntax lt node 0de lt node lt node Example demo gt device foreach c slave status nodeOO1 DOWN node002 04 DOWN demo gt device foreach g rack8 status demo gt device foreach n node001 node008 node016 node032 node080 status demo gt device Finally the wildcard character with foreach implies all the objects that the list command lists for that mode It is used without flags Example myheadnode gt device foreach get ip status 10 141 253 1 switch 2 2 sscds00 DOW 10 141 255 254 myheadnode C UU j 10 141 0 1 nodeOQO1 0 CLOSED 10 141 0 2 nodeOQ02 0 CLOSED myheadnode gt device Bright Computing Inc 50 Cluster Management with Bright Cluster Manager 3 7 Cluster Management Daemon The cluster management daemon or CMDaemon is a server process that runs on all nodes of the cluster including the head node The cluster man agement daemons work together to make the cluster manageable When applications such as cmsh and cmgui communicate with the cluster man agement infrastructure they are actually interacting with the cluster man agement
309. onds The show command will show the parameters and values of a specific consolidator Example metricconf CPUUser gt thresholds exit metricconf CPUUser consolidators metricconf CPUUser gt consolidators list Name key Length Interval Day 1000 86400 Hour 2000 3600 Week 1000 604800 metricconf CPUUser gt consolidators show day Parameter Value Interval 86400 Kind AVERAGE Length 1000 Name Day Offset 0 The meanings of the parameters are explained in the GUI equivalent of the above example in section 10 4 2 in the section labeled Metric Con figuration Consolidators Options The object manipulation commands introduced in section 3 6 3 will work as expected at this cmsh prompt level add and remove will add and remove a consolidator set get and clear will set and get values for the parameters of each consolidator refresh and commit will revert and commit changes use will use the specified consolidator making it the default for commands and validate applied to the consolidator will check if the consolidator object has sensible values cmsh monitoring setup healthconf The healthconf submode is the alternative to the metricconf submode under the main monitoring setup mode Like the metricconf option healthconf too can only be used with a device category specified If the session above is continued and the device category masternode is kept unchanged then the healthconf submode can
310. one but not yet committed Similarly a entry indicates an object that is to be removed on committing while a blank entry indicates that the object has been modified without an addition or removal involved Cloning an object is a convenient method of duplicating a fully con figured object When duplicating a device object cmsh will attempt to automatically assign a new IP address using a number of heuristics In the above example node101 is assigned IP address 10 141 0 101 Working With Objects get set refresh The get command is used to retrieve a specified property from an object and set is used to set it Example mycluster gt device use node101 mycluster gt device node101 get category test slave mycluster gt device node101 set category slave mycluster gt device node101 get category slave mycluster gt device node101 modified State Type Name Device node101 mycluster gt device node101 refresh mycluster gt device node101 modified No modified objects of type device mycluster gt device node101 get category test slave mycluster gt device node101 Here the category property of the node101 object is retrieved by us ing the get command The property is then changed using the set com mand Using get confirms that the value of the property has changed and the modified command reconfirms that node101 has local uncom mitted changes The refresh command undoes the changes
311. oning called a port mismatch This type of port mismatch situation occurs typically during a mistaken node swap when two nodes are taken out of the cluster and returned but their positions are swapped by mistake or equivalently they are returned to the correct place in the cluster but the switch ports they connect to are swapped by mistake To prevent configuration mistakes the node installer dis plays a port mismatch dialog Figure 6 10 allowing the user to retry accept a node configuration that is associated with the detected Eth ernet port or to manually select another node configuration via a sub dialog Figure 6 7 By default in the main port mismatch dia log port detection is retried after a timeout Figure 6 10 Scenarios Port Mismatch Dialog 6 The node is known and an Ethernet switch port is not detected However the configuration associated with the node s MAC ad dress does have an Ethernet port associated with it This is also considered a port mismatch To prevent configuration mistakes the node installer displays a port mismatch dialog similar to Fig ure 6 10 allowing the user to retry or to drop into a sub dialog and manually select a node configuration By default in the port mis match dialog port detection is retried after a timeout The node is known and an Ethernet switch port is detected How ever the configuration associated with the node s MAC address has no Ethernet sw
312. oning nodes thereby dis tributing network traffic loads when many nodes are booting Creating provisioning nodes is done by assigning the provisioning role to a node or category of nodes Bright Computing Inc 84 Node Provisioning 6 1 1 Provisioning Nodes Configuration Settings The provisioning role has several parameters that can be set Property Description allImages When set to yes the provisioning node provides all available images re gardless of any other parameters set By default it is set to no images A list of images provided by the pro visioning node These are used only if allImages is no maxProvisioningNodes The maximum number of nodes that can be provisioned in parallel by the provisioning node The optimum number depends on the infrastructure The default value is 10 which is safe for typical cluster setups Setting it lower may sometimes be needed to prevent network and disk overload nodegroups A list of node groups If set the pro visioning node only provisions mem bers of the listed groups By default this value is unset and the provisioning node supplies any node Typically this is used to set up a convenient hierar chy of provisioning for example based on grouping by rack and by groups of racks A provisioning node keeps a copy of all the images it provisions on its local drive in the same directory as where the head node keeps such im ages T
313. ork interfaces v Slave Nodes E node001 4 4 E3 node003 2 Monitoring ain yi other Devices Configurations 71GB 856 mmp eho 905MB 3561 MiB 1444MB 1182MB HEE eth 17899 MB 359 96 MB 25704MB 37568 mmm 37368 amp 9368 amme 38128 MB Nvarfibymysqcmdaemon_mon 646 2 MiB Monitoring Configuration _ Authorization Health Status B Authentication tuck 3 Event S ENT VIEWER Q Viewer B aicen a Cuser v Soure Yi Message ze p 2010 04 05 00 My Cluster myheadnode Check cputhreshcheck is in state PASS on myheadnode 2010 04 03 00 My Cluster de ci Check hcheck is in state FAIL on myheadnode i pi m ode 21 Sep 2010 0134 19 My Cluster myheadnode gde services refreshed Figure 10 4 cmgui Conceptual Overview Monitoring Types 10 2 11 Conceptual Overview Cmgui s Main Monitoring Interfaces Monitoring information is presented in several places in cmgui for con venience during everyday use The conceptual overview here covers the layout There are 4 monitoring related viewing areas for the user in cmgui Figure 10 4 1 Visualization Visualization of monitoring data is made available from cmgui s monitoring menu and launches a new window Graphs are gen erated from metrics and health checks data and these graphs are viewed in various ways within window panes The use of the visuali
314. ot expired If the attributes of the license are correct the remaining parts of this section 4 1 3 may safely be skipped Bright Computing Inc 56 Configuring The Cluster Requesting A License If the license has expired or if the license attributes are otherwise not cor rect a new license file must be requested Although the most convenient way to obtain such a license is with a cluster that is able to access the in ternet the request can also be made regardless of cluster connectivity to outside networks as will be elaborated upon shortly The request for a new license file is made using the request license command together with a product key The product key entitles the user to request a license and is a sequence of digits similar to the following 000354 515786 112224 207441 186713 A product key is obtained from any Bright Cluster Manager reseller and is activated by the user when obtaining the license A product key can obtain a license only once Upon product key activation the license obtained permits the cluster to work with particular settings for amongst others the period of use and the number of nodes There are four options to use the product key to get the license 1 If the cluster has access to the WWW port the product key is acti vated immediately on successfully completing the dialog started by the request license command e If the cluster uses a web proxy then the environment variable http_
315. ot of diskspace Use rm rf tmp cuda32 to remove the data Another method to verify that CUDA is working is to build and use the deviceQuery command on a node containing one or more GPUs The deviceQuery command lists all CUDA capable GPUs in a device along with several of their properties Example Lroot cuda test module load cuda32 toolkit Lroot cuda test cd CUDA_SDK C Lroot cuda test C make clean Lroot cuda test C make root cuda test C bin linux release deviceQuery bin linux release deviceQuery Starting CUDA Device Query Runtime API version CUDART static linking There are 2 devices supporting CUDA Device 0 Tesla T10 Processor CUDA Driver Version 3 20 CUDA Runtime Version 3 20 CUDA Capability Major Minor version number 1 3 Total amount of global memory 4294770688 bytes Multiprocessors x Cores MP Cores 30 MP x 8 Cores MP 240 Cores Total amount of constant memory 65536 bytes Total amount of shared memory per block 16384 bytes Total number of registers available per block 16384 Warp size 32 Maximum number of threads per block 512 Maximum sizes of each dimension of a block 512 x 512 x 64 Maximum sizes of each dimension of a grid 65535 x 65535 x 1 Maximum memory pitch 2147483647 bytes Texture alignment 256 bytes Clock rate 1 30 GHz Bright Computing Inc 12 5 CUDA 225 Concurrent copy and execution Yes Run time limit on kernels No Integrated No
316. ould be placed on DRBD filesystems Enter the hostnames of the primary and secondary head nodes and the physical disk partitions to use on both head nodes Confirm that the contents of the listed partitions can be erased on both head nodes After DRBD based filesystems have been created the current contents of the shared directories will be copied onto the DRBD based filesystems and the DRBD based filesystems will be mounted over the old non shared filesystems Once the setup process has completed select DRBD Status Overview to verify the status of the DRBD block devices 13 2 4 Automated Failover If automatic failover is desired the two head nodes must be able to power off their counterpart This is done by setting up power control see chap ter 5 for details The device power status command in cmsh can be used to verify that power control is functional Example Bright Computing Inc 13 3 Managing HA 243 master1 device power status n myclusteri mycluster2 apch3 21 eonun ON mycluster1 apc04 18 ON mycluster2 If IPMI is used for power control it is possible that a head node is not able to reach its own IPMI interface over the network This is esspe cially true when no dedicated IPMI network port is used In this case the device power status will report a failure for the active head node This does not necessarily mean that the head nodes can not reach the IPMI interface of their counterpart
317. over sequence involves taking over resources services and network addresses from the active head node The goal is to continue providing services to compute nodes to allow jobs running on these nodes to keep running 13 1 1 Services There are several services being offered by a head node to the cluster and its users One of the key aspects of the HA implementation in Bright Cluster Manager is that whenever possible services are offered on both the active as well as the passive head node This allows for the capacity of both machines to be used for certain tasks e g provisioning slave nodes but it also means that there are fewer services to move in the event of a failover sequence On a default HA setup the following services key for cluster opera tions are always running on both head nodes Bright Computing Inc 236 High Availability e CMDaemon providing certain functionality on both head nodes e g provisioning e DHCP load balanced setup e LDAP running in replication mode e MySQL running in multi master replication mode e NTP DNS When an HA setup is created the above services are automatically reconfigured for an HA environment with two head nodes In addition both head nodes will also receive the Provisioning role which means that slave nodes can be provisioned from both head nodes The implications of running a cluster with multiple provisioning nodes are described in section 6 1 Most importantly
318. ower case with the spaces removed However they are space and case insensitive so typing in show Power off with the quotes included to pass the space on is also valid The get command returns the value of an individual parameter of the action object Example myheadnode gt monitoring gt actions get poweroff runon master myheadnode gt monitoring gt actions cmsh monitoring actions add use remove commit refresh modified set clear and validate In the basic example of section 10 1 in Adding The Action To The Ac tions List the name description and command for an action were added via a dialog in the Actions tab of cmgui The equivalent is done in cmsh with add and set commands The add command adds an object makes it the current object and sets its name at the same time while the set command sets values If there is no killallyes action already then the name is added in the actions mode with the add command as follows Example myheadnode gt monitoring gt actions add killallyes myheadnode gt monitoring gt actions killallyes Bright Computing Inc 10 7 Monitoring Modes With Cmsh 195 The converse to the add command is the remove command which re moves the action The use command is the usual way of using an object where using means that the object being used is referred to by default by any com mand run So if the killallyes object already exists thenuse killallyes w
319. p The last stage of creating an HA setup involves setting up a shared stor age solution NAS 1 In the cmha setup main menu select the Setup Shared Storage option Select NAS Select the parts of the filesystem that should be copied to NAS filesys tems Configure the NFS server and the paths to the NFS volume for each of the chosen mountpoints Bright Computing Inc 242 High Availability ol DAS ray oa N fom If the configured NFS filesystems can be correctly mounted from the NAS server the process of copying the local filesystems onto the NAS server will begin In the cmha setup main menu select the Setup Shared Storage option Select DAS Select the parts of the filesystem that should be placed on shared DAS filesystems Enter the hostnames of the primary and secondary head nodes and the physical disk partitions to use on both head nodes Confirm that the contents of the listed partitions can be erased on both head nodes After filesystems have been created the current contents of the shared directories will be copied onto the shared filesystems and the shared filesystems will be mounted over the old non shared filesystems DRBD 1 wo N oo a In the cmha setup main menu select the Setup Shared Storage option Select DRBD Select Install DRBD to install the drbd RPMs if they have not been installed yet Select DRBD Setup Select the parts of the filesystem that sh
320. pm root cm images lustre client image q kernel yum install installroot cm images lustre client image kernel devel The Lustre software is then built and installed Example root mycluster root mycluster root mycluster root mycluster root mycluster root mycluster root mycluster root mycluster root mycluster root mycluster root mycluster root mycluster root mycluster chroot cm images lustre client image cd usr sre src ln s kernels uname r x86_64 linux src tar zxvf lustre 2 0 0 1 tar gz src cd lustre 2 0 0 1 lustre 2 0 0 1 configure disable server 0 0 1 0 0 1 0 0 1 0 0 1 cd usr sre rf lustre 2 0 0 1 linux lustre 2 make lustre 2 make install lustre 2 depmod a lustre 2 src rm src rm srcl exit To configure the Inet kernel module to use TCP IP the string options Inet networks tcp is added to the etc modprobe conf file of the client image root mycluster echo options lnet networks tcp gt gt cm images lust re client image etc modprobe conf Bright Computing Inc 234 Third Party Software Creating The Lustre Client Category Anode category is cloned for example slave to lustre client The software image is set to the Lustre client image Example root mycluster cmsh mycluster category mycluster gt category clone slave lustre client mycluster gt category lustre client set softwa
321. proxy must be set before request license is run From a bash prompt this is set with export http_proxy lt proxy gt where lt proxy gt is the hostname or IP address of the proxy 2 If the cluster does not have access to the WWW port the adminis trator may activate the product key by pointing an off cluster web browser to http support brightcomputing com licensing The CSR Certificate Sign Request data generated by running the request license command on the cluster is entered in the web form at that URL and a signed license will be returned This li cense is in the form of a plain text certificate As the web form response explains it is to be saved to the head node as a file and saving it directly is possible from most browsers Cutting and pasting it into an editor and saving it on the head node as a file will do the job too since it is plain text The license certificate is then installed by running the command install license lt filename gt on the head node 3 If no web access is available to the administrator the CSR data that was generated by the request license command may be sent by email to ca brightcomputing com A certificate will be emailed back from the Bright Cluster Manager License Desk This certificate can then be handled further as described in option 2 4 If no internet access is available at all to the administrator the CSR data may be faxed or sent by postal mail to any Bright Cluster Man
322. r show Parameter Value Consolidators lt 3 in submode gt Disabled no GapThreshold 2 LogLength 3000 Metric CPUUser MetricParam Only when idle no Sampling Interval 120 Stateflapping Actions Store yes ThresholdDuration 1 Thresholds lt 1 in submode gt myheadnode gt monit oring gt setup MasterNode gt metricconf CPUUser The add command adds a metric to be set for sampling for the device category The list of all possible metrics that can be added to the device category can be seen with the command monitoring metrics list or more conveniently simply with tab completion suggestions to the add command at the metricconf prompt in the above example The above example indicates that there are two submodes for each metric configuration Consolidators and Thresholds Running the consolidators or thresholds commands brings cmsh into the chosen submode Consolidation and threshold manipulation only make sense in the context of a metric configuration so at the metricconf prompt in the example above before use cpuuser is executed the commands thresholds cpuuser or consolidators cpuuser can be executed as more direct ways of getting to the chosen submode Bright Computing Inc 10 7 Monitoring Modes With Cmsh 203 thresholds If continuing on from the above example the thresholds submode is entered then the list command will list the existing thresh olds If the basic example of section 10 1 has already
323. r gt user printer show Parameter Value Group ID 503 Group members maureen Group name printer The clear command can also be used to clear members but will clear all of the extras from the group Example mycluster gt user printer clear groupmembers mycluster gt user printer show Parameter Value Group ID 503 Group members Group name printer The commit command is intentionally left out at this point in the session in order to illustrate how reversion is used in the next section Bright Computing Inc 124 User Management 7 2 4 Reverting To The Unmodified State This corresponds roughly to the functionality of the Revert button oper ation in section 7 1 This section 7 2 4 continues on from the state of the session at the end of section 7 2 3 There the state of group printers was changed so that the extra added members were removed This state the state with no group members showing was however not yet committed The refresh command will revert an uncommitted object back to the last committed state This happens at the level of the object it is using For example the object that is being handled here is the properties of the group printer Running revert at a higher level prompt say at user mode level would revert everything at that level and below So in order to affect only the properties of the group printer the refresh command is used at the group printer level prompt It will then revert
324. r Settings tab For the APC brand of PDUs the Power controlled by property in the Settings tab should be set to apc or the list of PDU ports will be ignored by default Overriding the default is described in section 5 1 3 Bright Computing Inc 74 Power Management Eile Monitoring View Help RESOURCES ie My Clusters YVE My Cluster v D switches E switcho1 v GNetworks externalnet SB internalnet apcOl v Software Images 2 default image v j Node Categories slave v Head Nodes v Slave Nodes E node001 E node002 i Other Devices v Node Groups Users amp Groups Workload Management v Power Distribution Units mycluster Hostname Hardware tag MAC address 00 Rack Ethernet switch Power controlled by Custom power script Power Distribution Unit Overview Tasks Settings System Inforn 1 y Position switchOl apcOl 2 v Height Port Port l My Cluster DOM SEMVICES Froce ui Power Distribution Unit apcOl apcOl Port Port PA Monitoring Configuration Power Distribution Unit Authorisation B Authentication EVENT VIEWER Q eo vi Time 4 Cluster vi Soure vi Message v a A a Figure 5 1 Head Node Settings Since nodes may have multiple power feeds there may be multiple PDU ports defined for a single device The cluster management infras
325. r example it may be useful to have a printer group so that several users can share access to a printer For the sake of this example continu ing our session from where it was left off above tim and fred are now added to the LDAP directory along with a printer group Bright Computing Inc 7 2 Managing Users And Groups With cmsh 123 Example mycluster gt user maureen add user tim add user fred mycluster gt user fred add group printer mycluster gt user printer commit mycluster gt user printer Note the context switch that happened here in the cmsh user mode en vironment the context of user maureen was eventually replaced by the context of group printer As a result the group printer is committed but the users tim and fred are not yet committed which is indicated by the asterisk at the user mode level Continuing onwards to add users to a group the append command is used A list of users maureen tim and fred can be added to the printer group like this Example mycluster gt user printer append groupmembers maureen tim fred commit mycluster gt user printer show Parameter Value Group ID 503 Group members maureen tim fred Group name printer To remove users from a group the removefrom command is used A list of specific users for example tim and fred can be removed from a group like this mycluster gt user printer removefrom groupmembers tim fred commit mycluste
326. r metric Example 4 add responsiveness fresponsiveness set command cm local apps cmd scripts metrics sample_responsiveness 4 set classofmetric prototype commit 1 2 Metric Collections Initialization When a metric collections script is added to the framework for the first time it is implicitly run with the initialize flag which detects and adds component metrics to the framework The displayed output of a metric collections script when using the initialize flag is a list of available metrics and their parameter values The format of each line in the list is metric lt name gt lt unit gt lt class gt lt description gt lt cumulative gt lt min gt lt max gt where the parameters are metric A bare word name The name of the metric unit A measurement unit class Any of misc cpu disk memory network environmental operatingsystem internal workload cluster description This can contain spaces but should be enclosed with quotes Bright Computing Inc 302 Metric Collections cumulative Either yes or no default is no This indicates whether the metric increases monotonically e g bytes received or not e g temperature min and max The minimum and maximum numeric values of this met ric which still make sense Example root myheadnode metrics sample_responsiveness initialize metric util_sda internal Percentage of CPU time during which I 0 requests wer
327. r start Compiling CUDA3 2 driver installing probe OK root mycluster dmesg PCI Setting latency timer of device 0000 07 00 0 to 64 PCI Setting latency timer of device 0000 09 00 0 to 64 NVRM loading NVIDIA UNIX x86_64 Kernel Module 260 19 21 Thu Nov 4 16 27 PDT 2010 12 5 2 Verifying CUDA An extensive method to verify that CUDA is working is to run the verify_cuda32 sh script located in the CUDA SDK directory This script first copies the CUDA SDK source to a local directory un der tmp It then builds CUDA test binaries and runs them It is possible to select which of the CUDA test binaries are run A help text showing available script options is displayed when verify_cuda32 sh h is run Example root cuda test module load cuda32 toolkit root cuda test cd CUDA_SDK root cuda test 3 2 verify_cuda32 sh Copy cuda32 sdk files to tmp cuda32 directory make clean make can take a while Bright Computing Inc 21 224 Third Party Software Run all tests y N y Executing tmp cuda32 C bin linux release alignedTypes alignedTypes CUDA device Tesla T10 Processor has 30 Multi Processors SM scaling value 1 00 gt Memory Size 49999872 Allocating memory All cuda32 just compiled test programs can be found in the tmp cuda32 C bin linux release directory They can be executed from the tmp cuda32 C directory The tmp cuda32 directory can take up a l
328. rection is then needed to proceed further If all settings are valid the installation proceeds on to the Nameservers screen described in the next section Bright Cluster Manager Installer Networks a Management network internainet externalnet ibnet ipminet j 7 Networks Network parameters externainet Name externalnet Base address Netmask Domain name Default gateway Network type Ethernet oe oo ce Figure 2 11 Networks Configuration Nameservers and search domains Nameservers and search domains can be added or removed using the Nameservers screen Figure 2 12 Clicking on Continue leads to the Network Interfaces configuration screen described next Bright Computing Inc 2 3 Head Node Installation 15 Bright Cluster Manager Installer Nameservers m Nameservers Add nameserver Search Domains Nameservers Add search domain i Figure 2 12 Nameservers and search domains Network Interfaces Configuration The Network Interfaces screens Figures 2 13 and 2 14 show the list of network interfaces that have been predefined for type 1 and type 3 setups respectively Each screen has a network configuration section for the head node and for the regular nodes For node network interfaces the IP offset can be modified The offset is used to calculate the IP address assigned to the interface on the se lected network For example a different offset might be desirable
329. reimage lustre client image mycluster gt category lustre client commit The Lustre client category is configured to mount the Lustre filesystem some text in the display here is elided Example root mycluster cmsh mycluster category mycluster gt category use lustre client mycluster gt category lustre client fsmounts mycl fsmounts add mt lustre00 myc fsmounts mnt lustre00 set device 10 141 16 1 tcp0 lustre00 myc fsmounts mnt lustre00 set filesystem lustre myc fsmounts mnt lustre00 set mountoptions rw _netdev myc fsmounts mnt lustre00 commit The configured fsmounts device is the MGS which in the example has IP address 10 141 16 1 The network type used in the example is TCP IP Creating Lustre Client Nodes A client node is created as follows Example root mycluster cmsh mycluster device mycluster gt device add slavenode lclient001 10 141 48 1 mycluster gt device lclient001 set category lustre client mycluster gt device 1client001 commit The Lustre client is booted and checked to see if the Lustre filesystem is mounted The Lustre file stripe configuration of the filesystem can be checked with lfs getstripe The Lustre file striping can be set with 1fs setstripe Example root lclient001 lfs getstripe mnt lustre00 root lclient001 lfs setstripe s 4M o 1 c 1 mt lustre00 The 1fs setstripe command in the e
330. rical comparison of a metric value with a threshold value A health check on the other hand has more general checking capa bilities Bright Computing Inc 10 2 Monitoring Concepts And Definitions 169 With some inventiveness a health check can be made to do the func tion of a metric s threshold action sequence as well as the other way round The considerations above should help decide what the appropriate tool health check or metric threshold check should be for the job 10 2 6 Severity Severity is a positive integer value that the administrator assigns to a threshold crossing event or to a health check status event By default it is 10 It is used in the AlertLevel metric section 10 2 7 10 2 7 AlertLevel AlertLevel is a special metric It is not sampled but it is re calculated when an event with an associated Severity section 10 2 6 occurs There are two types of AlertLevel metrics 1 AlertLevel max which is simply the maximum severity of the latest value of all the events The aim of this metric is to alert the admin istrator to the severity of the most important issue 2 AlertLevel sum which is the sum of the latest severity values of all the events The aim of this metric is to alert the administrator to the overall severity of issues 10 2 8 InfoMessages InfoMessages are messages that inform the administrator of the reason for a health status event change in the cluster These show up in the O
331. rk cmflocal apps cmd scripts healthchecks portchecker P fall Slave Nodes ssh2node Network Jem local apps cmdjscripts healthchecks ssh2node P Gia GPU Units Idap Operating System Jemjlocal apps cmd scripts healthchecks ldap gt Gi Other Devices mysql Operating System emjlocal apps cmd scripts healthchecks mysql G Node Groups failedprejob Workload cm local apps cmd scripts healthchecks failedprejob BUsers amp Groups rogueprocess Workload em local apps cmd scripts healthchecks rogueprocess Workload Management Authorization Figure 10 27 cmgui Monitoring Main Health Checks Tab or adding metrics with a few exceptions The exceptions are for options that are inapplicable for health checks and are elaborated on in appendix H 2 2 10 4 6 Actions The Actions tab lists available actions Figure 10 28 that can be set to run on the system from metrics thresholds configuration as explained in sec tion 10 4 2 and as was done in the basic example of section 10 1 Actions can also be set to run from health check configuration action launcher options as described in section 10 4 3 Eile Monitoring View Help RESOURCES 2s Monitoring Configuration E my chster Ti My Clusters R cK vE wy Cluster Name V Descriptio i Switches ue aoi Remove a node from further u Pekia killallyes kill all yes processes cmflocal apps cmd scripts actions killallyes killprocess Action which kills processes of pids
332. rkload Managers 137 Head Nodes folder selecting the head node and selecting the Roles tab to display the possible roles After a workload manager server role is cho sen and saved Figure 8 1 the workload manager process automatically starts up Ele Monitoring View Help RESOURCES bright51 E Briont 5 1 cluster b C Switches gt E Networks gt ij Power Distributi SGE Client Role gt E Software Images gt Ei Node Categories v Head Nodes Overview Tasks Settings System Information Services Process Management Network Setup FS Mounts FS Exports SGE Server Role Torque Client Role Torque Server Role gt Chassis Gj Slave Nodes gt GPU Units gt ij Other Devices gt G Node Groups amp Users amp Groups Workload Manag torque x PBSPro Client Role PBSPro Server Role 2 Monitoring Conti Authorization e Figure 8 1 Workload Management Role Assignment On A Head Node Similarly the workload manager client process can be enabled on a node by having the workload manager client role assigned and saved for that node The client process then automatically starts up While role assignment can be done as described for individual nodes it is usually more efficient to do role assignment using categories due to the large number of compute nodes in typical clusters All non head nodes are by default placed
333. rmed 11 5 1 BIOS Configuration In order to configure the BIOS on a group of nodes an administrator needs to manually configure the BIOS on a reference node using the con ventional method of entering BIOS Setup mode at system boot time After the BIOS has been configured the machine needs to be booted as a node The administrator may subsequently use the cmospul1 utility on the node to create a snapshot of the reference node s NVRAM contents Example ssh node001 cm shared apps cmbios nodebios cmospull gt node001 nvram After the NVRAM settings of the reference node have been saved to a file the settings need to be copied to the generic DOS image so that they can be written to the NVRAM of the other nodes The generic DOS image is located in cm shared apps cmbios nodebios win98boot img It is generally a good idea to copy the generic image and make changes to the copy only Example cp a win98boot img flash img To modify the image it is first mounted mount o loop flash img mnt When the DOS image has been mounted the utility that writes out the NVRAM data needs to be combined with the NVRAM data into a sin gle DOS executable This is done by appending the NVRAM data to the cmosprog bin file The result is a DOS COM executable Example cat cmosprog bin node001 nvram gt cmosprog com Bright Computing Inc 212 Day to day Administration The generated COM is then copied to the image and should be
334. rning FAIL if it is not checking if CPUUser is below 50 and returning PASS if it is checking if the cmsh binary is found and returning UNKNOWN if it is not A health check has a settable severity section 10 2 6 associated with a FAIL or UNKNOWN response This value is processed for the AlertLevel metric see section 10 2 7 when the health check runs A health check can also launch an action based on any of the response values similar to the way that an action is launched by a metric with a threshold condition 10 2 5 Conceptual Overview Health Checks Vs Threshold Checks A health check is quite similar to a threshold state check with a metric Conceptually however they are intended to differ as follows e A threshold state check works with numeric values A health check on the other hand works with a response state of PASS FAIL or UNKNOWN Threshold checking does not store a history of whether the thresh old condition was met or not it just calls the action script right away as its response Admittedly the associated metric data val ues are still kept by the monitoring framework so establishing if a threshold has been crossed historically is always possible with a little effort A health check on the other hand stores its PASS FAIL UNKNOWN re sponses for the monitoring framework making it easily accessible for viewing by default e The threshold checking mechanism is intended to be limited to do ing a nume
335. rough selection and viewing the separate graphs of its component metrics 10 4 5 Health Checks The Health Checks tab lists available health checks Figure 10 27 These can be set to run from the system by configuring them from the Health Check Configuration tab of section 10 4 3 What the listed health checks on a newly installed system do are de scribed in appendix H 2 1 The remove revert and save buttons work for health checks just like they do for metrics in section 10 4 4 Also the edit and add buttons start up dialogs to edit and add health checks The dialog options for health checks are the same as for editing Bright Computing Inc 190 Cluster Monitoring Eile Monitoring View _ Help A Monitoring Configuration E My Custer E switcho1 rview Metric Configuration th Che nfiguration Metrics HealthChecks Action a Netrorks leat Modified W Name v as v Command v eS espora Disk cm local apps cmd scripts healthchecks exports v Power Distribution Units mounts Disk Jcmjlocal apps cmdjscripts healthchecks mounts fall Sinna mists cmsh Internal Jem local apps cmd scripts healthchecks emsh defaultimage DevicelsUp Internal lt built in gt a Nodal Cxtey erie failover Internal Jemjlocal apps cmd scripts healthchecks failover v Head Nodes ManagedServicesOk Internal lt built in gt E myheadnode testhealthcheck Misc cm local apps cmd scripts healthchecks esthealthcheck gt i chassis portchecker Netwo
336. rried out on them are All ethernet switches and All Power Distribution Units With the screen displaying a list of health checks as in Figure 10 22 the health checks in the Health Check Configuration tab can now be configured and manipulated The buttons used to do this are Edit Add Remove Revert and Save These Health Configuration tab buttons behave just like the corre sponding Metric Configuration tab buttons of section 10 4 2 that is 1 For completeness the time t consolidation gone Which is how many seconds into the past the consolidated data goes and is viewable is given by an analogous equation to that of the equation defining traw gone tconsolidation gone Log length consolidation x Sampling interval consolidation Bright Computing Inc 184 Cluster Monitoring File Monitoring View Help RESOURCES ie Monitoring Configuration E My Cluster T My Clusters Overview Metric Configuration Health Check Configuration Metrics Health Checks Actions vE My Cluster vE switches Health Check Configuration AllMasterNodes v E switcho1 v Networks Health C W Para W _ Log length datap W Sampling interval sec W Passac W Failac w Store w R Se externalnet cmsh 3000 1800 Z internalnet 6 Y Power Distribution Units DevicelsUp 3000 120 v v Software Images exports 3000 1800 default image failedprejob 3000 900 v v Node Categories failover 3000 1800
337. rtition gt lt device gt lt diskSetup gt D 3 Example Preventing Accidental Data Loss The following example shows the use of the vendor and requiredSize tags These are optional tags which can be used to prevent accidentally repartitioning the wrong drive If a vendor or a requiredSize element is specified it is treated as an assertion which is checked by the node installer If any assertion fails no partitioning changes will be made to any of the specified devices Note that the node installer reads a drives vendor string from sys block lt drive name gt device vendor Specify ing device assertions is recommended for machines that contain impor tant data as it will serve as a protection against situations where drives are assigned to incorrect block devices This could happen for example when the first drive in a multi drive system is not detected e g due to a hardware failure which could cause the second drive to become known as dev sda lt xml version 1 0 encoding IS0 8859 1 gt lt diskSetup xmlns xsi http www w3 org 2001 XMLSchema instance xsi noNamespaceSchemaLocation schema xsd gt lt device gt lt blockdev gt dev sda lt blockdev gt lt vendor gt Hitachi lt vendor gt lt requiredSize gt 200G lt requiredSize gt lt partition id ai gt lt size gt max lt size gt lt type gt linux lt type gt lt filesystemext3 lt filesystem gt lt mountPoint gt lt mountPoint gt l
338. s Number of I O operations currently in progress I0Time Number of milliseconds spent doing I O LoadFifteen Load average on 15 minutes LoadFive Load average on 5 minutes LoadOne Load average on 1 minute MemoryFree Free system memory MemoryUsed Used system memory MergedReads Total number of merged reads MergedWrites Total number of merged writes NetworkBytesRecv Cluster wide number of bytes received on all networks continues Bright Computing Inc H 1 Metrics And Their Parameters 291 Table H 1 1 List Of Metrics continued Name Description NetworkBytesSent NetworkUtilization NodesUp OccupationRate PDUBankLoad PDULoad PDUUptime PacketsRecv PacketsSent PhaseLoad ProcessCount QueuedJobs RackSensorHumidity RackSensorTemp ReadTime Reads RunningJobs RunningProcesses SMARTHDATemp SMARTReallocSecCnt SMARTSeekErrRate SMARTSeekTimePerf SMARTSoftReadErrRate SectorsRead SectorsWritten SensorFanSpeed SensorTemp SensorVoltage SwapFree SwapUsed SwitchBroadcastPackets Swit chCPUUsage SwitchCollisions Cluster wide number of bytes transmitted on all networks Network utilization estimation Number of nodes in status UP Cluster occupation rate Total PDU bank load Total PDU phase load PDU uptime Number of received packets Number of packets sent Cluster wide phase load Total number of processes Number of queued jobs Rack sensor humidity Rack sensor
339. s added and a threshold was set up with the monitoring framework With a default installation on a newly installed cluster the measure ment of CPUUser is done every 120s The basic example configured above therefore monitors if CPUUser on the head node has crossed 50 every 120s If CPUUser is found to have entered into the zone beyond 50 then the framework runs the killallyes script killing all running yes processes Assuming the system is trivially loaded apart from these yes processes the CPUUser metric value then drops to below 50 Note that after an Enter threshold condition has been met for a sam ple the first sample immediately after that does not ever meet the Enter threshold condition because an Enter threshold crossing condition re quires the previous sample to be below the threshold The second sample can only launch an action if the Enter threshold condition is met and if the preceding sample is below the threshold Other non yes CPU intensive processes running on the head node can also trigger the killallyes script Since the script only kills yes pro cesses leaving any non yes processes alone it would in such a case run unnecessarily This is a deficiency due to the contrived and simple nature of the basic example being illustrated here and is of no real concern Bright Computing Inc 10 2 Monitoring Concepts And Definitions 167 10 2 Monitoring Concepts And Definitions A discussion of the concepts of
340. s are available for Linux and for Windows XP Vista and a MacOS X version will be available in the future On a Windows desktop cmgui is installed by running the installer and following the installation procedure After the installation cmgui is started through the Start menu or through the desktop shortcut For Linux cmgui is installed by untarring the tar bz2 file and com piling it using make A number of dependency packages as listed in the accompanying README may first have to be installed for make to com plete successfully After a successful make the cmgui script can be run from the cmgui directory Example user desktop gt tar xjf cmgui 5 1 r2174 src tar bz2 user desktop gt cd cmgui 5 1 r2174 user desktop cmgui 5 1 r2174 gt make user desktop cmgui 5 1 r2174 gt cd cmgui user desktop cmgui 5 1 r2174 cmgui gt cmgui If the cmgui script reports unresolved symbols then additional pack ages from the Linux distribution need to be installed as listed in the ac companying README and recompilation done Bright Computing Inc 3 4 Cluster Management GUI 33 At least the following software libraries must be installed in order to run cmgui e OpenSSL library e GTK library e GLib library e Boost library at least the thread and signals components 3 4 2 Connecting to a Cluster As explained in section 3 3 2 a certificate and private key are required to connect to the cluster managemen
341. s attributes of cmsh decides what images the provisioning node supplies The Software image in the Roles tab should not be confused with the Software image selection possibility within the Settings tab which is the image the provisioning node requests for itself 6 1 4 Provisioning Nodes Housekeeping The head node does housekeeping tasks for the entire provisioning sys tem Provisioning is done on request for all non head nodes on a first come first serve basis Since provisioning nodes themselves too need to be provisioned it means that to cold boot an entire cluster up quickest the head node should be booted and be up first followed by provisioning nodes and finally by all other non head nodes Following this start up se quence ensures that all provisioning services are available when the other non head nodes are started up Some aspects of provisioning housekeeping are discussed next Provisioning Node Selection When a node requests provisioning the head node allocates the task to a provisioning node If there are several provisioning nodes that can pro vide the image required then the task is allocated to the provisioning node with the lowest number of already started provisioning tasks Limiting Provisioning Tasks With MaxNumber0fProvisioningThreads Besides limiting how much simultaneous provisioning per provisioning node is allowed with maxProvisioningNodes Section 6 1 1 the head node also limits how many simultaneous
342. s field xpath gt lt xs unique gt lt xs unique name assertNamesUnique gt lt xs selector xpath assert gt lt xs field xpath name gt lt xs unique gt lt xs element gt lt xs complexType name diskless gt lt xs attribute name maxMemSize type memSize use required gt Bright Computing Inc D 1 Structure of Partitioning Definition 267 lt xs complexType gt lt xs simpleType name memSize gt lt xs restriction base xs string gt lt xs pattern value 0 9 MG 1100 0 9 0 91 0 9 10 gt lt xs restriction gt lt xs simpleType gt lt xs simpleType name size gt lt xs irestriction base xs string gt lt xs pattern value max auto 0 9 MGT gt lt xs restriction gt lt xs simpleType gt lt xs simpleType name extentSize gt lt xs restriction base xs string gt lt xs pattern value 0 9 M gt lt xs restriction gt lt xs simpleType gt lt xs complexType name device gt lt xs sequence gt lt xs element name blockdev type xs string minOccurs 1 maxOccurs unbounded gt lt xs element name vendor type xs string minOccurs 0 maxOccurs 1 gt lt xs element name requiredSize type size minOccurs 0 maxOccurs 1 gt lt xs element name assert minOccurs 0 maxOccurs unbounded gt
343. s installed in the following location cm local apps cmd etc cert pem and the associated private key file is in cm local apps cmd etc cert key Bright Computing Inc 54 Configuring The Cluster Eile Monitoring View Help RESOURCES Bright 5 1 Cluster viS My Clusters ja Overview Settings Failover Ra E devhp License Information v0 Networks 5 namna Licensee IC US ST CalifornialL San Jose O Testing OU Development CN Bright 5 1 Cluster internalnet Version 5 1 gt O Power Distribution Units v0 Software Images Edition Advanced So ne eee ae Serialnumber 3171 v Node Categories di slave Startdate 28 Nov 2010 v0 Head Nodes 51 centos5 End date 07 Feb 2106 v Chassis v0 Slave Nodes E node001 Licenses used 2 gt E GPU Units ij Other Devices MAC address 22 22 22 22 22 22 v0 Node Groups Users amp Groups ly M Node licenses 3 Figure 4 1 License Information To verify that the attributes of the license have been assigned the cor rect values the License tab of the GUI can be used to display license details Figure 4 1 Alternatively the licenseinfo in cmsh main mode may be used Example Lroot 51 centos5 cmsh 51 centos5 main licenseinfo License Information Licensee C US ST California L San Jose 0 Testing OU Developm ent CN Bright 5 1 Cluster Serial Number 3171 Start Time Sun Nov 28 00 00 00 2010 End Time Tue Nov 2 23 5
344. s set to custom using either cmgui or cmsh and the value of custompowerscript is specified The value for custompowerscript is the full path to an executable custom power management script on the head node s of a cluster A custom power script is invoked with the following mandatory argu ments myscript lt operation gt lt device gt where lt device gt is the name of the device on which the power opera tion is done and lt operation gt is one of the following ON OFF RESET STATUS On success a custom power script exits with exit code 0 On failure the script exits with a non zero exit code Using custompowerscriptargument The mandatory argument values for lt operation gt and lt device gt are passed to a custom script for processing For example in bash the po sitional variables 1 and 2 are typically used for a custom power script A custom power script can also be passed a further argument value by setting the value of custompowerscriptargument for the node via cmsh or cmgui This further argument value would then be passed to the posi tional variable 3 in bash An example custom power script is located at cm local apps cmd scripts powerscripts power examp1e Init setting 3 to a natural num ber delays the script via a sleep command by 3 seconds An example that is conceivably more useful than a sleep 3 command is to have a wakeonlan 3 command instead If the custompowerscriptargument value is s
345. s that could be chosen are All ethernet switches if ethernet switches are to have their metrics configured or All Power Distribution Units if power distribution units are to have their metrics configured A Monitoring Configuration My Cluster LP Sverview Metric Configuration HealtiCheck Configuration Metncs HealtiChecks Actions Metric Configuration All Master Nodes X Metric v Parameter wi Log length datapoints wi Sampling interval seconds v Soew BytesSent etho 3000 120 BytesSent ethl 3000 120 CacheMemory 3000 120 CMDMemUsed 6000 120 Completed Jobs all q 6000 60 CPUCoresAvailable 3000 120 CPUldle 3000 120 CPUIrq 3000 120 CPUNice 3000 120 CPUSoftirq 3000 120 CPUSystem 3000 120 lt S S A BRAU 4 cPUWait 3000 120 CtxtSwitches 3000 120 v Remove Thresholds Consolidators Figure 10 16 cmgui Monitoring Metric Configuration Display After Cat egory Selection With the screen displaying a list of metrics as in Figure 10 16 the met rics in the Metric Configuration tab can now be configured and manip ulated The buttons used to do this are Edit Add Remove Thresholds Consolidators Revert and Save The Save button saves as yet uncommitted changes made via the Add or Edit buttons The Revert button discards unsaved edits made via the Edit button The reversion goes back to the last save The Remove button removes a selected metric from the metrics listed The r
346. screen Figure 2 23 summarizes the installation settings and parameters configured during the previous stages If the express mode installation was chosen then it summarizes the predefined settings and parameters Changes to the values on this screen are made by navi gating to previous screens and correcting the values there When the summary screen displays the right values clicking on the Start button leads to the Installation Progress screen described next Bright Computing Inc 22 Installing Bright Cluster Manager Bright Cluster Manager Installer Summary English US ey Primary external interface IP Primary external interface Netmask Primary external interface Gateway Q Primary internal interface IP Primary internal interface Netmask Nameservers Timezone Timeservers Summary emea A a Figure 2 23 Summary of Installation Settings Installation The Installation Progress screen Figure 2 24 shows the progress of the installation It is not possible to navigate back to previous screens once the installation has begun When the installation is complete Figure 2 25 the installation log can be viewed in detail by clicking on Install Log The Reboot button restarts the machine The BIOS boot order may need changing or the DVD should be removed in order to boot from the hard drive on which Bright Cluster Manager has been installed Bright Cluster Manager Installer Installation Progress of i
347. se longq bright51 gt jobqueue torque gt longq show Parameter Value Maximal runtime 23 59 59 Minimal runtime 00 00 00 Queue type Execution Routes Type torque name longq nodes node001 cm cluster node002 cm cluster bright51 gt jobqueue torque gt longq get maximalruntime 23 59 59 8 6 3 Nodes Drainage Status And Handling In cmsh Running the device mode command drainstatus displays if a specified node is in a Drained state or not In a Drained state jobs are not allowed to start running on that node Running the device mode command drain puts a specified node in a Drained state Example root bright51 cmsh bright51 device bright51 gt device drainstatus Node Queue Status node001 workq node002 workq bright51 gt device drain node001 Node Queue Status node001 workq Drained Both the drain and drainstatus commands have the same options The options can make the command apply to not just one node but to a list of nodes a group of nodes a category of nodes or to a chassis Continuing the example bright51 gt device drain c slave for a category of nodes Node Queue Status node001 workq Drained node002 workq Drained The help text for the command indicates the syntax root bright51 cmsh c device help drain Usage drain ic cgoiiciss domi aiid aoaan Drain the current node drain lt node gt 05 Drain the specified node drain lt n nodes nodelist g
348. se was installed In order to allow nodes to obtain a new node certificate all nodes must be rebooted Please issue the following command to reboot all nodes pexec reboot Rebooting Nodes After An Install The first time a product key is used After using a product key with the command request license during a cluster installation and then running install license a reboot is required of all nodes in order for them to pick up and install their new certificates The install license script has at this point already renewed the administrator certificates for use with cmsh and cmgui on the head node The parallel execution com mand pexec reboot suggested towards the end of the install license script output is what can be used to reboot all other nodes Since such a command is best done by an administrator manually pexec reboot is not scripted The subsequent times that a product key is used On running the com mand request license for the cluster the administrator is prompted on whether to re use the existing keys and settings from the existing license If the existing keys are kept a pexec reboot is not required This is be cause these keys are X509V3 certificates issued from the head node Any user or node certificates generated using the same certificate are therefore still valid and so regenerating them for nodes via a reboot is not required allowing users to continue working uninterrupted Bright Computing Inc 4 2 Network Setti
349. section covers how to use cmsh to configure monitoring The cmsh monitoring mode is how metrics and health checks are configured from the command line and corresponds to the configuration carried out by cmgui in section 10 4 Visualization of data similar to how cmgui does it in section 10 3 can also be done from cmsh s command line via its device mode Graphs can be obtained from cmsh by piping values returned by device mode commands such as dumpmetricdata and latestmetricdata into graph ing utilities These techniques will not be covered in this chapter Familiarity is assumed with handling of objects as described in the introduction to working with objects section 3 6 3 When using cmsh s monitoring mode the properties of these objects the details of the mon itoring settings are the parameters and values which are accessed and manipulated from the monitoring mode hierarchy within cmsh The monitoring mode of cmsh gives access to 4 modes under it Example root myheadnode cmsh myheadnode monitoring help tail 5 Sesssssssssssssssssssssssssss Monitoring SSS SSssssssssssssssssssssssss ACTIONS eriekin eta Gee eee Enter threshold actions mode healthchecks 0005 Enter healthchecks mode Metrics elie dione eee Enter metrics mode SOU UP siuna n ea p aiaei Enter monitoring configuration setup mode These 4 modes are regarded as being the top level monitoring related modes Bright Computing Inc
350. ser Queue Status TorqueJob 90 bright51 maud hydroq S bright51 gt jobs torque resume 90 bright51 cm cluster torque job Success bright51 gt jobs torque list only torque jobs Type Job ID User Queue Status TorqueJob 90 bright51i maud hydroq R Bright Computing Inc 148 Workload Management jobs Mode in cmsh The show Command The show command for a particular scheduler and job lists details of the job Continuing with the preceding example bright51 gt jobs torque show 90 bright51 cm cluster Parameter Value Arguments q hydroq home maud sleeper sh Executable In queue Job ID 90 bright51i cm cluster Job name sleeper sh Mail list Mail options a Maximum wallclock time 02 00 00 Memory usage 0 Nodes node001 Notify by mail yes Number of processes 1 Priority 0 Queue hydroq Run directory home maud Running time 809 Status R Stderr file Stdout file Submission time Type User bright51 cm cluster home maud sleeper sh e90 bright51 cm cluster home maud sleeper sh 090 Fri Feb 18 12 49 01 2011 TorqueJob maud 8 6 2 Job Queues Display And Handling In cmsh jobqueue Mode Properties of scheduler job queues can be viewed and set in jobqueue mode jobqueue Mode In cmsh Top Level If a scheduler submode is not set then the list qstat and listpes commands will operate as is expected on all queues for all schedulers At the top level of jobqueue mode e list lists the
351. shared internal IP interface so that the imported filesystems will continue to be accessible in the event of a failover Shared interfaces are normally implemented as alias interfaces on the physical interfaces e g eth0 0 13 1 3 Dedicated Failover Network In addition to the internal and external network interfaces on both head nodes the two head nodes are usually also connected using a direct ded icated network connection This connection is used between the two Bright Computing Inc 13 1 HA Concepts 237 head nodes to monitor their counterpart s availability It is highly rec ommended to run a UTP cable directly from the NIC of one head node to the NIC of the other Not using a switch means there is no disruption of the connection in the event of a switch reset 13 1 4 Shared Storage Almost any HA setup also involves some form of shared storage between two head nodes The reason for this is that state must be preserved after a failover sequence It would be unacceptable for user home directories to be unavailable to the cluster in the event of a failover In the most common HA setup the following three directories are shared e User home directories i e home e Shared tree containing applications and libraries that are made avail able to the slave nodes i e cm shared e Node certificate store i e cm node installer certificates The shared filesystems are only available on the active head node For
352. spro current cm cm setup pbspro q PBS Pro software components are then installed and initialized in cm shared apps pbspro current also referred to as the PBS_HOME Users must load the pbspro environment module which sets PBS_HOME and other environment variables in order to use PBS Pro PBS Pro documentation is available at http www pbsworks com SupportDocuments aspx PBS Pro Configuration Configuration of PBS Pro is done using its qngr command and is covered in the PBS Pro documentation Running PBS Pro PBS Pro runs the following three daemons 1 a pbs_server daemon running typically on the head node This handles submissions acceptance and talks to the execution dae mons on the compute nodes when sending and receiving jobs It writes logs to the cm shared apps pbspro current spool server_logs directory on its node Queues for this service are con figured with the qmgr command 2 a pbs_sched scheduler daemon also typically running on the head node It writes logs to the cm shared apps pbspro current spool sched_logs directory 3 a pbs_mom execution daemon running on each compute node This accepts manages and returns the results of jobs on the compute nodes It writes logs to cm local apps pbspro current spool mom_logs on the compute nodes Bright Computing Inc 144 Workload Management 8 5 Using cmgui With Workload Management Viewing the workload manager services from cmgui is described in s
353. ssword New password Re enter new password Change password for admin certificate file y N y Enter old password Enter new password Verify new password Password updated 3 3 2 Certificates While a Bright Cluster Manager cluster accepts ordinary ssh based logins for cluster usage the cluster management infrastructure requires public key authentication using X509v3 certificates Public key authentication using X509v3 certificates means in practice that the person authenticating Bright Computing Inc 3 3 Authentication 31 to the cluster management infrastructure must present their certificate i e the public key and in addition must have access to the private key that corresponds to the certificate There are two main file formats in which certificates and private keys are stored e PEM In this the certificate and private key are stored as plain text in two separate PEM encoded files e PFX also known as PKCS12 In this the certificate and private key are stored in one encrypted file Although both formats are supported the PFX format is preferred since it is more convenient a single file instead of two files and allows the private key data to be encrypted conveniently with a password By default one administrator certificate is created to interact with the cluster management infrastructure The certificate and corresponding private key can be found on a newly installed Bright Cluster Manager cluster
354. ster to run compute jobs The User Manual is intended to get such users up to speed with the user environment and workload man agement system Updated versions of the Administrator Manual as well as the User Man ual are always available on the cluster at cm shared docs cm The manuals constantly evolve to keep up with the development of the Bright Cluster Manager environment and the addition of new hard ware and or applications The manuals also regularly incorporate customer feedback Adminis trator and user input is is greatly valued at Bright Computing so that any comments suggestions or corrections will be very gratefully accepted at manuals brightcomputing com 0 3 Getting Administrator Level Support If the Bright Cluster Manager software was obtained through a reseller or system integrator then the first line of support is provided by the reseller or system integrator The reseller or system integrator in turn contacts the Bright Computing support department if 2nd or 3rd level support is required If the Bright Cluster Manager software was purchased directly from Bright Computing then support brightcomputing com can be contacted for all levels of support Introducti 1 1 What Is Bright Cluster Manager Bright Cluster Manager is a cluster management application built on top of a major Linux distribution Bright Cluster Manager 5 1 is available on e Scientific Linux 5 x86_64 only e RedHat Enterprise Linux Server 5 x86_64
355. stname gt Create a new device of the given type with specified hostname add lt type gt lt hostname gt lt ip gt Create a new device of the given type with specified hostname and boot interface with given ip type slavenode masternode ethernetswitch ibswitch myrinetswitch powerdistributionunit genericdevice racksensor chassis gpuunit Working With Objects clone modified remove Continuing on with the node object node100 that was created in the pre vious example it can be cloned to node101 as follows Example mycluster gt device clone node100 node1i01 mycluster gt device node101 exit mycluster gt device modified Bright Computing Inc Cluster Management with Bright Cluster Manager State Type Name Cloned node101 mycluster gt device commit mycluster gt device mycluster gt device remove node100 mycluster gt device commit mycluster gt device The modified command is used to check what objects have uncom mitted changes and the new object node101 that is seen to be modified is saved with a commit The device node100 is then removed by using the remove command A commit executes the removal The modified command corresponds roughly to the functionality of the List of Changes menu option under the View menu of cmgui s main menu bar The entry in the State column in the output of the modified com mand in the above example indicates the object is a newly added
356. storage storage nodes 3 1 4 Node Groups A node group consists of nodes that have been grouped together for con venience The group can consist of any mix of all kinds of nodes irrespec tive of whether they are head nodes or ordinary nodes and irrespective of what if any category they are in A node may be in 0 or more node groups at one time I e anode may belong to many node groups Node groups are used mainly for carrying out operations on an entire group of nodes at a time Since the nodes inside a node group do not necessarily share the same configuration configuration changes cannot be carried out using node groups Example Node Group Members broken node087 node783 node917 headnodes mycluster m1 mycluster m2 rack5 node212 node254 top node084 node126 node168 node210 3 1 5 Roles A role is a task that can be performed by a node By assigning a certain role to a node an administrator activates the functionality that the role represents on this node For example a node can be turned into provi sioning node or a login node by assigning the corresponding roles to the node Bright Computing Inc 28 Cluster Management with Bright Cluster Manager Roles can be assigned to individual nodes or to node categories When a role has been assigned to a node category it is implicitly assigned to all nodes inside of the category Some roles allow per node parameters to be set that influence t
357. sysconfig clock CMDaemon Entire file etc postfix main cf CMDaemon Section etc postfix generic CMDaemon Section etc aliases CMDaemon Section etc ntp conf CMDaemon Entire file etc ntp step tickers CMDaemon Entire file RedHat only etc named conf CMDaemon Entire file var named CMDaemon Entire file RedHat only var 1ib named CMDaemon Entire file SuSE only Bright Computing Inc 251 Files generated automatically in software images File Generated By Method Comment etc localtime CMDaemon Entire file etc hosts CMDaemon Section etc sysconfig ipmicfg CMDaemon Entire file etc sysconfig clock CMDaemon Entire file etc sysconfig kernel CMDaemon Section SuSE only etc sysconfig network config MDaemon Section SuSE only etc sysconfig network routes CMDaemon Entire file SuSE only boot vmlinuz CMDaemon Symlink boot initrd CMDaemon Symlink boot initrd CMDaemon Entire file etc modprobe conf CMDaemon Entire file etc postfix main cf CMDaemon Section etc postfix generic CMDaemon Section etc aliases CMDaemon Section Files generated automatically on nodes File Generated By Method Comment etc hosts Node installer Section etc exports CMDaemon Section etc fstab Node installer Section etc sysconfig network Node installer Entire file etc sysconfig network ifcfg Node installer Entire file SuSE only etc sysconfig network scripts ifcfg Nodeinstaller Entire file RedHat only etc ntp conf Node insta
358. t Drain all nodes in the list drain lt gl group group gt Drain all nodes in the group drain lt c category category gt Drain all nodes in the category drain lt h chassis chassis gt Drain all nodes in the chassis nodelist e g node001 node015 node20 node028 node030 Bright Computing Inc 152 Workload Management 8 7 Examples Of Workload Management Assignment 8 7 1 Setting up A New Category And A New Queue For It Suppose a new node with a GPU is added to a cluster that originally has no nodes with GPUs This merits a new category GPUnodes so that ad ministrators can configure more new GPU nodes such as this efficiently It also merits a new queue gpuq so that users are aware that they can submit GPU optimized jobs to the GPU queue To create a new queue the Workload Management item is selected and the Queues tab selected The Add button is used to associate a newly created queue with a scheduler and add it to the workload manager The modification is then saved Figure 8 10 jle Monitoring View Help RESOURCES as Workload Management E nodeoo2 gt E GPU Units gt H Other Devices v Node Groups amp Users amp Groups 2 Monitoring Configuration 5 Authorization B Authentication Figure 8 10 Adding A New Queue Via cmgui A useful way to create a new category is to simply clone the old slave category over to a new category and then change paramet
359. t assert name modelCheck args WD800AAJS gt lt CDATA bin bash if grep q 1 sys block ASSERT_DEV device model then exit 0 else exit 1 fi gt lt assert gt lt partition id ai gt lt size gt max lt size gt lt type gt linux lt type gt lt filesystemext3 lt filesystem gt lt mountPoint gt lt mountPoint gt Bright Computing Inc 272 Disk Partitioning lt mountOptions gt defaults noatime nodiratime lt mountOptions gt lt partition gt lt device gt lt device gt lt blockdev gt dev sdb lt blockdev gt lt vendor gt BigRaid lt vendor gt lt requiredSize gt 2T lt requiredSize gt lt partition id bi gt lt size gt max lt size gt lt type gt linux lt type gt lt filesystem gt ext3 lt filesystem gt lt mountPoint gt data lt mountPoint gt lt mountOptions gt defaults noatime nodiratime lt mountOptions gt lt partition gt lt device gt lt diskSetup gt D 5 Example Software RAID This example shows a simple software RAID setup The level tag speci fies what type of RAID is used The following RAID levels are supported 0 striping without parity 1 mirroring 4 striping with dedicated par ity drive 5 striping with distributed parity and 6 striping with dis tributed double parity The member tags must refer to an id attribute of a partition tag or an id attribute of a another raid tag The latter can be used to create for example R
360. t detection Figure 6 8 or to drop into a sub dialog to manually select a node configuration Figure 6 7 By default port detection is retried after a timeout Bright Computing Inc 6 3 Node Installer Figure 6 8 Scenarios Unknown Node Ethernet Port Detected 3 The node is new and an Ethernet switch port is not detected The node installer then displays a dialog that allows the user to either retry Ethernet switch port detection Figure 6 9 or to drop into a sub dialog to manually select a node configuration Figure 6 7 By default port detection is retried after a timeout Figure 6 9 Scenarios Unknown Node No Ethernet Port Detected 4 The node is known and an Ethernet switch port is detected The configuration associated with the port is the same as the configu ration associated with the node s MAC address The node installer then displays the configuration as a suggestion along with a confir mation dialog Figure 6 6 The suggestion can be interrupted and other node configurations can be selected manually instead through a sub dialog Figure 6 7 By default in the main dialog the origi nal suggestion is accepted after a timeout 5 The node is known and an Ethernet switch port is detected How ever the configuration associated with the port is not the same as the configuration associated with the node s MAC address This is Bright Computing Inc 94 Node Provisi
361. t infrastructure Both are available when running cmgui on the cluster However before making the initial connection from a desktop computer running cmgui a PFX file containing both the certificate and private key must be copied from the cluster and stored in a secure location on the local filesystem Example user desktop gt mkdir cmgui keys user desktop gt chmod 700 cmgui keys user desktop gt scp root mycluster root cm cmgui admin pfx cmgui ke ys mycluster admin pfx Eile Monitoring View Help RESOURCES sa Welcome to Bright Cluster Manager je Add a new cluster EVENT VIEWER Q x v Time 4 Cluster v Soure v Message v amp Ready A SS Figure 3 1 Cluster Management GUI welcome screen When cmgui is started for the first time the welcome screen Fig ure 3 1 is displayed To configure cmgui for connections to a new Bright Cluster Manager cluster the cluster is added to cmgui by clicking the E button in the welcome screen Figure 3 2 shows the dialog window in which the connection parameters can be entered Bright Computing Inc 34 Cluster Management with Bright Cluster Manager Host mycluster 8081 Connect at start up Certificate home user cmgui keys mycluster lt Browse Password ecceee ca Figure 3 2 Edit Cluster dialog window The host can be a name or an IP address If the port on the hos
362. t is not specified then port 8081 is added automatically The certificate location entry is where the administrator certificate admin pfx file is located The password is the password to the administrator certificate Section 3 3 has details on the admin pfx file as well as on how to change the password used in the dialog with the passwdpfx or cm change passwd utilities After the cluster is added the screen displays the connection parame ters for the cluster Figure 3 3 Eile Monitoring View Help RE E sio Welcome to Bright Cluster Manager My Cluster m Modified No Host yc Connected No Certificate ho 17 2 Connect Add a new cluster e user cmgui keys mycluster admin ptx EVENT VIEWER Q x All Events 7v Time amp Cluster wv Source vi Message VFR Figure 3 3 Connecting to a cluster Clicking on the Connect button establishes a connection to the cluster and cmgui then displays a tabbed pane overview screen of the cluster Figure 3 4 Bright Computing Inc 3 5 Navigating the Cluster Management GUI 35 File Monitoring View Help My Cluster v Gi Switche 8 switch01 v i Networks externalnet amp internalnet v Power Distribution Units amp apcOl v Software Images 2 default image v Node Categories slave v Head Nodes mycluster v Slave Nodes E node001 E node002 v Other Devices v Node Groups
363. t is run ning on and has decided what its node configuration is It now gets on with setting up the interfaces required for the installer with IP addressing while taking care of matters that come up on the way avoiding duplicate IP addresses The node installer brings up all the network interfaces configured for the node Before starting each interface the node installer first checks if the IP address that is about to be used is not already in use by another device If it is then a warning and retry dialog is displayed until the IP address conflict is resolved using BOOTIF to specify the boot interface BOOTIF is a special name for one of the possible interfaces The node installer automatically translates BOOTIF into the name of the device such as eth0 or eth1 used for network booting This is useful for a machine with multiple network interfaces where it can be unclear whether to spec ify for example eth0 or eth1 for the interface that was used for booting Using the name BOOTIF instead means that the underlying device eth0 or eth1 in this example does not need to be specified in the first place halting on missing kernel modules for the interface For some interface types like VLAN and channel bonding the node installer halts if the required kernel modules are not loaded or are loaded with the wrong module options In this case the kernel modules config uration for the relevant software image should be reviewed Recreating
364. t mountOptions gt defaults noatime nodiratime lt mountOptions gt Bright Computing Inc D 4 Example Using custom assertions 271 lt partition gt lt device gt lt device gt lt blockdev gt dev sdb lt blockdev gt lt vendor gt BigRaid lt vendor gt lt requiredSize gt 2T lt requiredSize gt lt partition id bi gt lt size gt max lt size gt lt type gt linux lt type gt lt filesystemext3 lt filesystem gt lt mountPoint gt data lt mountPoint gt lt mountOptions gt defaults noatime nodiratime lt mountOptions gt lt partition gt lt device gt lt diskSetup gt D 4 Example Using custom assertions The following example shows the use of the assert tag which can be added to a device definition The assert tag is similar to the vendor and size tags described before It can be used to define custom assertions The assertions can be implemented using any script language The script will have access to the environment variables ASSERT_DEV i e sda and ASSERT_NODE i e dev sda Each assert needs to be assigned an arbi trary name and can be passed custom parameters The exit code should be non zero when the assert should trigger the node installer to halt lt xml version 1 0 encoding ISO0 8859 1 7 gt lt diskSetup xmlns xsi http www w3 org 2001 XMLSchema instance xsi noNamespaceSchemaLocation schema xsd gt lt device gt lt blockdev gt dev sda lt blockdev gt l
365. ta target must be created To create the metadata target a free disk partition or logical volume is used The disk device can also be a external storage device and or a redundant storage device The metadata server also acts as a management server To format a metadata target mkfs lustre is used For example to format dev sdb and set the Lustre filesystem name to lustre00 Example root mds001 mkfs lustre fsname lustre00 mdt mgs dev sdb The filesystem is mounted and the entry added to etc fstab Example Lroot mds001 mkdir mnt mdt Lroot mds001 mount t lustre dev sdb mnt mdt Lroot mds001 echo dev sdb mnt mdt lustre rw 0 0 gt gt etc fstab Creating The Lustre Object Storage Target On the object storage server one or multiple object storage target s can be created To create the object storage target free disks partitions or logical volumes are used The disk devices can also be a external storage device and or a redundant storage device To format a object storage target mkfs lustre is used For example to format dev sdb set the management node to 10 141 16 1 the filesys tem name to lustre00 and the network type to TCP IP Example root oss001 mkfs lustre fsname lustre00 ost mgsnode 10 141 1 6 1 tcpO dev sdb The filesystem is mounted and the entry added to etc fstab Example Lroot oss001 mkdir mnt ost01 root oss001 mount t lustre dev sdb mnt ost01
366. tab for the selected device category These properties are the configuration of the sampling parame ters themselves for example frequency and length of logs but also the configuration of related properties such as thresholds consolidation ac tions launched when a threshold is crossed and actions launched when a metric state is flapping The Metric Configuration tab is initially a blank tab until the device category is selected by using the Metric Configuration selection box The selection box selects the device category from a list of built in cate gories and user defined node categories node categories are introduced in section 3 1 3 On selection the metrics of the selected device category are listed in the Metric Configuration tab Properties of the metrics Bright Computing Inc 178 Cluster Monitoring related to sampling are only available for configuration and manipula tion after the metrics list displays Handling metrics in this manner via groups of devices is slightly awkward for just a few machines but for larger clusters it keeps administration scalable and thus manageable Figure 10 16 shows an example of the Metric Configuration tab af ter All master nodes is chosen as the device category This corresponds to the basic example of section 10 1 where A11 master nodes was the de vice category chosen because it was the CPUUser metric on a master node that was to be monitored Examples of other device categorie
367. taller Cannot Start IPMI Interface In some cases the node installer is not able to configure a node s IPMI in terface and displays an error message similar to figure 6 26 Usually the issue can be solved by adding the correct IPMI kernel modules to the soft ware image s kernel module configuration However in some cases the node installer is still not able to configure the IPMI interface If this is the case the IPMI card probably does not support one of the commands the node installer uses to set specific settings To solve this issue setting up IPMI interfaces can be disabled globally by setting the setupIpmi field in the node installer configuration file cm node installer scripts node installer conf to false Doing this disables configuration of all IPMI interfaces by the node installer A custom finalize script can then be used to run the required commands instead Bright Computing Inc 6 7 Troubleshooting The Node Boot Process 117 Figure 6 26 No IPMI Interface Bright Computing Inc User Management Unix users and groups for the cluster are presented to the administra tor in a single system paradigm That is if the administrator manages them with the Bright Cluster Manager then the changes are automati cally shared across the cluster via the LDAP service This chapter describes how to add remove and edit users and groups using the Bright Cluster Manager 7 1 Managing Users And Groups With cm
368. tches Metric Configuration All Master Nodes x g switch01l v0 networks Metric vi Pammeter wi Log length datapoints wi Sampling interval seconds v Soew R externalnet CPUSystem 3000 120 v E S internalnet v Power Distribution Units gt Gi Software Images default image v v Node Categories Monitoring configuration All Master Nodes 7 slave Metric CPUUser Parameter v v GHead Nodes v E myheadnode name Bound Severity BeginActions EndActions amp B La v Chassis v v0 Slave Nodes v E node001 v E3 node002 z v E3 node003 Ai Remove v v GPU units v other Devices ipForwData X New Threshold ooe v Node Groups ipFragCrea AB Users amp Groups iamma Name kilalyesthreshoki Workload Management ipFragOKs ipinAddrErr Bound 50 Authorization iginnelvar upper bound value gt bound BE Authentication ipinDiscard Boundtype wer bound value lt bound ipInHdrErro N ipinReceive ose 10 IpinUnknos Action kilalyes During ipOutDiscai ipOutNoRo ipOutReque ipReasmOb ipReasmRe LoadFifteer a LoadFive ia LoadOne MemoryFree 3000 MemoryUsed 3000 Thresholds Figure 10 3 cmgui Monitoring Configuration Setting A Threshold Clicking on Ok exits the New Threshold dialog clicking on Done exits the Thresholds dialog and clicking on Save saves the threshold setting associated with CPUUser on the head node The Result In the above an action wa
369. tel cluster checker intel cluster runtime cluster check packages home cmsupport intel cluster ready recipe root xml mv master 20090522 173125 list home cmsupport intel cluster ready head package list mv node001 20090522 173125 list home cmsupport intel cluster ready node package list 12 4 4 Running Intel Cluster Checker Regular User Run The cmsupport account is used to perform the regular user run The fol lowing commands start the cluster checker su cmsupport module initadd intel cluster checker intel cluster runtime module load intel cluster checker intel cluster runtime cluster check certification 1 1 home cmsupport intel cluster ready recipe user ib xml Bright Computing Inc 222 Third Party Software The cluster checker produces two output files one xml and one out which include time stamps in the filenames In the event of failing tests the output files should be consulted for details as to why the test failed When debugging and re running tests the include_only test pa rameter can be passed to cluster check to execute just the specified test and the tests on which it depends Privileged User Run The privileged user run should be started as the root user The following commands start the cluster checker module load shared intel cluster checker intel cluster runtime cluster check certification 1 1 home cmsupport intel cluster ready recipe root ib xml In a heterogeneous
370. tem e In cmsh using the category or device modes The get command is used for viewing the script and the set command to start up the default text editor to edit the script The output is truncated in the two following examples at the point where the editor starts up Example root bright51 cmsh bright51 category use slave bright51 gt category slave show grep script Parameter Value Finalize script lt 1367 bytes gt Initialize script lt 0 bytes gt bright51 gt category slave set initializescript Example bright51 device use node001 bright51 gt device node001 bright51 gt device node001 set finalizescript The imageupdate_initialize and imageupdate_finalize scripts are similar scripts but run as their name implies when the imageupdate Bright Computing Inc 278 Example initialize And finalize Scripts command is run and not during node provisioning They are discussed further in section 6 5 2 For the initialize and finalize scripts node specific customiza tions can be made from a script using environment variables The follow ing script does not actually do anything useful but does show available variables bin bash echo HOSTNAME HOSTNAME echo HWTAG HWTAG echo MAC MAC echo PARTITION PARTITION echo RACKINDEX RACKINDEX echo DEVICEPOSITION DEVICEPOSITION echo DEVICEHEIGHT DEVICEHEIGHT echo INSTALLMODE INSTALLMODE echo CATEGORY CATEGOR
371. tem in the tree For example Node Categories Slave Nodes and Networks Selecting a resource folder in the tree displays a list of resource items inside the folder These are displayed in the resource tree and in the tabbed pane Resource items in the tabbed pane can be selected and operations carried out on them by clicking on the buttons at the bottom of the tabbed pane For example for the Slave Nodes resource one or more nodes can be selected and the Open Add Clone and Remove buttons can be clicked to operate on the selection Figure 3 7 Eile Monitoring View Help Head Nodes E brights1 v Chassis TE Slave Node E nodeoo1 E nodeoo2 v GPU units v Other Devices vE Node Groups Users amp Groups Workload Management IA Monitoring Configuration Slave Nodes Overview Modified Hostname fasks onl d 001 nodeoo2 00 0C 29 01 0F F8 ANG emenncl 241 0 Bright 5 1 Cluster Category vy P ve slave 10 141 0 2 a Bright Computing Inc Figure 3 7 Nodes Overview 38 Cluster Management with Bright Cluster Manager 3 6 Cluster Management Shell This section introduces the basics of the cluster management shell cmsh This is the command line interface to cluster management in Bright Clus ter Manager Since cmsh and cmgui give access to the same cluster man agement functionality an administrator need not become familiar with both interfaces Admi
372. ter management software to commu nicate with the switch or PDU SNMP must be enabled and the SNMP community strings should be configured correctly By default the SNMP community strings for switches and PDUs are set to public and private for respectively read and write access If different SNMP community strings have been configured in the switch or PDU the readstring and writestring properties of the corresponding switch device should be changed Example mycluster device use switch01 mycluster gt device switch01 get readstring public mycluster gt device switch01 get writestring private mycluster gt device switch01 set readstring public2 mycluster gt device switch0i set writestring private2 mycluster gt device switch01 commit Bright Computing Inc 4 5 Configuring Switches and PDUs 71 4 5 3 Uplink Ports Uplink ports are switch ports that are connected to other switches CM Daemon must be told about any switch ports that are uplink ports or the traffic passing through an uplink port will lead to mistakes in what CM Daemon knows about port and MAC correspondence Uplink ports are thus ports that CMDaemon is told to ignore To inform CMDaemon about what ports are uplink ports cmgui or cmsh are used e In cmgui the switch is selected from the Switches folder and the Settings tabbed pane is opened Figure 4 8 The port number cor responding to uplink port number is filled in the blank
373. terface NetworkAliasInterface NetworkBondInterface e CMD_IPMIUSERNAME username for the IPMI device at this node if available e CMD_IPMIPASSWORD password for the IMPI device at this node if available To parse the above information to get the IPMI IP address of the node for which this script samples one could use in perl Bright Computing Inc 304 Metric Collections my ip my interfaces ENV CMD_INTERFACES foreach my interface split interfaces if ENV CMD_INTERFACE_ interface _TYPE eq NetworkIpmilInterface ip ENV CMD_INTERFACE_ interface _IP last ip holds the ipmi ip 1 6 Metric Collections Examples Bright Cluster Manager has several scripts in the cm local apps cmd scripts metrics directory Among them are the metric collections scripts testmetriccollection and sample_responsiveness A glance through them while reading this appendix may be helpful Bright Computing Inc Changing The Network Parameters Of The Head Node J 1 Introduction After a cluster physically arrives at its site the first configuration task that an administrator usually faces is to change the network settings to suit the network at the site How to configure network interfaces is detailed in section 4 2 1 of the Bright Cluster Manager Administrator Manual and is easy to do However there is some reliance on having understood the material leading up t
374. ternal network Figure 1 1 illustrates a typical cluster network setup Figure 1 1 Cluster network Most clusters are equipped with one or more power distribution units These units supply power to all compute nodes and are also connected to the internal cluster network The head node in a cluster can use the power control units to switch compute nodes on or off From the head node it is straightforward to power on off a large number of compute nodes with a single command 1 3 Bright Cluster Manager Administrator And User Environment Bright Cluster Manager contains several tools and applications to facili tate the administration and monitoring of a cluster In addition Bright Bright Computing Inc 1 4 Organization of This Manual Cluster Manager aims to provide users with an optimal environment for developing and running applications that require extensive computational resources 1 4 Organization of This Manual The following chapters of this manual describe all aspects of Bright Clus ter Manager from the perspective of a cluster administrator Chapter 2 gives step by step instructions to installing Bright Cluster Manager on the head node of a cluster Readers with a cluster that was shipped with Bright Cluster Manager pre installed may safely skip this chapter Chapter 3 introduces the main concepts and tools that play a central role in Bright Cluster Manager laying down groundwork for the remain ing
375. ters and awaits user input to start the install Otherwise if normal installation mode was selected earlier then the Kernel Modules configuration screen is displayed described next Bright Computing Inc Installing Bright Cluster Manager Bright Cluster Manager Installer Bright Computing Software License English US a 88 Bright nputing Inc Software License License okt T usi Except i I r Q Brig J mer use th r juipm zi Q v ac pt the Software License and agree to be bound to its term eoma Figure 2 2 Bright Cluster Manager Software License Bright Cluster Manager Installer Base Distribution EULA EnglshUS iy Scientifi nux E License Figure 2 3 Base Distribution End User License Agreement Kernel Modules Configuration The Kernel Modules screen Figure 2 4 shows the kernel modules rec ommended for loading based on hardware auto detection Clicking the button opens an input box for entering the module name and optional module parameters Clicking the Add button in the input box adds the kernel module The button removes a selected module from the list and the arrow buttons move a kernel module up or down in the list Ker nel module loading order decides the exact name assigned to a device Bright Computing Inc 2 3 Head Node Installation e g sda sdb eth0 eth1 After optionally adding or removing kernel modules clicking Continue leads to the Hardware
376. th values to avoid truncating the full path of the com mands in the display The above example shows a truncated list of health checks that can be set for sampling on a newly installed system The details of what these health checks do is covered in appendix H 2 1 The show command of cmsh displays the parameters and values of a specified health check Example myheadnode gt monitoring gt healthchecks show deviceisup Parameter Value Class of healthcheck Command Description Disabled Extended environment Name Only when idle Parameter permissions Sampling method State flapping count Timeout Valid for internal lt built in gt Returns PASS when device is up closed or instat no no DeviceIsUp no disallowed samplingonmaster 7 5 slave master pdu ethernet myrinet ib racksensort myheadnode gt monitoring gt healthchecks Bright Computing Inc 10 7 Monitoring Modes With Cmsh 197 The meanings of the parameters are covered in appendix H 2 2 As detailed in section 10 7 1 tab completion suggestions for the show command suggest arguments corresponding to names of objects that can be used in this mode For show in healthchecks mode tab completion suggestions give the following as possible health check objects Example myheadnode gt monitoring gt healthchecks show cmsh failover mounts ssh2node deviceisup cpucheck mysql testhealthcheck exports ldap portchecker failedprejob manageds
377. the following possible device categories Example myheadnode gt monitoring gt setup metricconf chassis ibswitch racksensor ethernetswitch masternode slave genericdevice myrinetswitch gpuunit powerdistributionunit A category can be chosen with the use command and show will show the properties of the category With a category selected the metricconf or healthconf submodes can then be invoked Example myheadnode gt monitoring gt setup use masternode myheadnode gt monitoring gt setup MasterNode show Parameter Value Category MasterNode Health configuration lt 9 in submode gt Metric configuration lt 88 in submode gt Normal pickup interval 180 Scrutiny pickup interval 60 myheadnode gt monitoring gt setup MasterNode metricconf myheadnode gt monit oring gt setup MasterNode gt metricconf Dropping into a submode in the example given the metricconf submode could also have been done directly in one command metricconf mastermode The synopsis of the command in the example is actually monitoring setup metricconf masternode where the optional parts of the command are invoked depending upon the context indicated by the prompt The example below clarifies this some prompt text elided for display purposes Example gt monitoring gt setup MasterNode gt metricconf exit exit exit exit 4 monitoring setup metricconf masternode gt monitoring gt setup MasterNode gt metri
378. the node s category and then an initialize script if it exists from the node s configuration The node installer sets several environment variables which can be used by the initialize script Appendix E contains an example script documenting these variables Related to the initialize script are e The finalize script section 6 3 11 This may run after node provi sioning is done but just before the init process on the node run e The imageupdate_initialize and imageupdate_finalize scripts which may run when the imageupdate command runs sec tion 6 5 2 Bright Computing Inc 102 Node Provisioning 6 3 6 Checking Partitions Mounting File Systems In the previous section the node installer determines the install mode value along with when to apply it to a node The install mode value de faults mostly to AUTO If AUTO applies to the current node it means the node installer then checks the partitions of the local drive and its file sys tems and recreates them in case of errors Partitions are checked by com paring the partition layout of the local drive s against the drive layout as configured in the node s category configuration and the node configura tion After the node installer has checked the drive s and if required recre ated the layout it mounts all file systems to allow the drive contents to be synchronized with the contents of the software image If install mode values of FULL or MAIN apply to t
379. the node prefix may need to be changed according to the actual settings of the cluster After modifying the image if there are provisioning nodes the updateprovisioners command section 6 1 4 should be run The nodes can then simply be rebooted to pick up the new image or alternatively to avoid rebooting the imageupdate command section 6 5 2 can be run to pick up the new image To make the new setting take effect the following command is run using a parallel shell etc init d sshd restart After this change is implemented only the root user can log in to a node from the head node Users may however still log in from any node to any other node as this is needed for a number of MPI implementations to function properly Administrators may choose to disable interactive jobs in the workload management system as a measure to prevent users from starting jobs on other nodes The workload management system documentation has more on configuring this 11 3 Getting Help With Bugs And Other Issues Bright Cluster Manager is constantly undergoing development While the result is a robust and well designed cluster manager it is possible that the administrator may run into a bug or other issue with it that requires help from Bright Computing or Bright Computing s resellers This section describes how to report such problems and get help for them 11 3 1 Getting Support From The Reseller If the Bright Cluster Manager software was obtained through
380. the roles submode Example root bright51 cmsh bright51 device bright51 gt device use node001 bright51 gt device node001 roles bright51 gt device node001 gt roles assign torqueclient bright51 gt device node001 gt roles torqueclient commit bright51 gt device node001 gt roles torqueclient After workload manager roles are assigned or unassigned on the head and compute nodes the associated workload manager services automat ically start up or stop as appropriate Once a setting has been assigned for the workload manager within the roles submode whether within a main mode of category or devices the workload manager settings can be handled with the usual object com mands introduced in section 3 6 3 Example bright51 gt category slave gt roles torqueclient show Parameter Value All Queues yes Name torqueclient Queues shortq longq Slots 4 Type TorqueClientRole bright51 gt category slave gt roles torqueclient set slots 5 bright51 gt category slave gt roles torqueclient commit bright51 gt category slave gt roles torqueclient 8 3 3 Monitoring The Workload Manager Services The workload manager services are monitored Restart attempts are made if the services stop unless the role for that workload manager service is unassigned As mentioned previously role unassignment is how the workload manager service should be disabled The daemon service states can
381. time measurement data values in the graph are displayed on the graph toolbar by hovering the mouse cursor over the graph 3 The graph view adjustment buttons are e play pause by default the graph is refreshed with new data every 2 minutes This is disabled and resumed by clicking on the pause play button on the graph toolbar e zoom out zoom in Clicking on one of the magnifying glasses zooms in or zooms out on the graph in time This way data values can be shown even from many months ago Zooming in with mouse gestures is also possible and is discussed in sec tion 10 3 3 broadcast A time scale synchronizer Toggling this button to a pressed state for one of the graphs means that scale changes carried out via magnifying glass zooms see bullet point above or via mouse gestures see section 10 3 3 are done over all the other graph display panes too so that their x ranges match This is useful for large numbers of nodes my headnode CPUUser close widget zoom in settings zoom out time measurement value pair Figure 10 8 Graph Display Pane Features Bright Computing Inc 174 Cluster Monitoring e settings Clicking on this button opens a dialog window to modify certain aspects of the graph The settings dialog is dis cussed in section 10 3 4 4 Any number of graph display panes are laid out by using the Grid menu option of the main Monitoring Pane Figure 10 6 5 Multip
382. tion stop stop the cluster management daemon start start the cluster management daemon restart restart the cluster management daemon status report whether cluster management daemon is running full status report detailed statistics about cluster man agement daemon upgrade update database schema after version up grade expert only debugon enable debug logging expert only debugoff disable debug logging expert only Example To restart the cluster management daemon on the head node of a cluster root mycluster etc init d cmd restart Bright Computing Inc 3 7 Cluster Management Daemon 51 Waiting for CMDaemon to terminate Stopping CMDaemon OK Waiting for CMDaemon to start Starting CMDaemon OK root mycluster 3 7 2 Configuring The Cluster Management Daemon Some cluster configuration changes can be done by modifying the cluster management daemon configuration file For the head node this is located at cm local apps cmd etc cmd conf For ordinary nodes it is located inside of the software image that the node uses Appendix C describes all recognized configuration file directives and how they can be used Normally there is no need to modify the default settings After modifying the configuration file the cluster management dae mon must be restarted to activate the changes 3 7 3 Configuration File Generation As part of its tasks the cluster management daemon writes o
383. tion and high availability configuration are done has implications on what configuration files need to be adjusted 1 For LDAP replication configuration done after high availability con figuration adjusting the new suffix in cm local apps openldap etc slapd conf and in etc ldap conf on the passive node to the local cluster suffix suffices as a configuration 2 For high availability configuration done after LDAP replication con figuration the initial LDAP configurations and database are prop agated to the passive node To set replication to the passive node Bright Computing Inc 7 4 Using Kerberos Authentication 129 from the active node and not to the passive node from an external server the provider option in the syncrep1 directive on the pas sive node must be changed to point to the active node and the suf fix in cm local apps openldap etc slapd conf on the passive node must be set identical to the head node The high availability replication event occurs once only for configu ration and database files in Bright Cluster Manager s high availability system Configuration changes made on the passive node after the event are therefore persistent 7 4 Using Kerberos Authentication The default Bright Cluster Manager 5 1 setup uses LDAP for storing user information and for authentication This section describes how LDAP can be configured to use a Kerberos V5 authentication back end assuming a Kerberos server has
384. tion credentials is currently only possi ble through cmsh It is possible to change the authentication credentials cluster wide or by category Category settings override cluster wide set tings The relevant properties are Property Description IPMI User ID User type Normally set to 2 for administra tor access IPMI User Name Username IPMI Password Password for specified user name The cluster management infrastructure stores the configured IPMI username and password not just to configure the IPMI interface from the node installer The information is also used to authenticate to the IPMI Bright Computing Inc 4 4 Configuring InfiniBand Interfaces interface once it has been brought up in order to perform IPMI man agement operations e g power cycling nodes and collecting hardware metrics Example Changing IPMI username and password for the entire cluster mycluster partition use base mycluster gt partition base set ipmiusername ipmiadmin mycluster gt partition base set ipmipassword enter new password k retype new password mycluster gt partition base commit mycluster gt partition base 4 4 Configuring InfiniBand Interfaces On clusters with an InfiniBand interconnect the InfiniBand Host Channel Adapter HCA in each node must be configured before it can be used 4 4 1 Installing Software Packages On a standard Bright Cluster Manager cluster the OFED
385. to an Ethernet switch and if so to which port Setting up Ethernet switches for port detection is covered in section 4 5 If a port is detected for the node the node installer queries CMDae mon for a node configuration associated with the detected Ethernet switch port If a port is not detected for the node then either the hardware in volved with port detection needs checking or a node configuration must be selected manually There are thus several scenarios 1 The node is new and an Ethernet switch port is detected A pre vious configuration associated with the port is found The node installer suggests to the administrator that the new node use this configuration and displays the configuration along with a confir mation dialog Figure 6 6 This suggestion can be interrupted and other node configurations can be selected manually instead through a sub dialog Figure 6 7 By default in the main dialog the origi nal suggestion is accepted after a timeout Bright Computing Inc 92 Node Provisioning Figure 6 6 Scenarios Configuration Found Confirm Node Configuration Hostname Category Power Network Figure 6 7 Scenarios Node Selection Sub Dialog 2 The node is new and an Ethernet switch port is detected A previ ous configuration associated with the port is not found The node installer then displays a dialog that allows the administrator to ei ther retry Ethernet switch por
386. to be able to measure power usage over time This chapter describes the Bright Cluster Manager power manage ment features In section 5 1 the configuration of the methods used for power op erations is described Section 5 2 then describes the way the power op erations commands themselves are used to allow the administrator turn power on turn power off reset the power and retrieve the power sta tus and explains how these can be applied to devices in various ways Section 5 3 briefly covers monitoring power 5 1 Configuring Power Parameters Several methods exist to control power to devices e Power Distribution Unit PDU based power control e IPMI based power control for node devices only e Custom power control e HP iLO based power control for node devices only 5 1 1 PDU Based Power Control For PDU based power control the power supply of a device is plugged into a port on a PDU The device can be a node but also anything else with a power supply such as a switch The device can then be turned on or off by changing the state of the PDU port To use PDU based power control the PDU itself must be a device in the cluster and be reachable over the network The Settings tab of each device object plugged into the PDU is then used to configure the PDU ports that will control it Figure 5 1 shows the Settings tab for a head node Each device plugged into the PDU can have PDU ports added and removed with the and buttons in thei
387. trators The modules environment provides facilities to control aspects of a users interactive sessions and also the environment used by compute jobs Section 3 3 introduces how authentication to the cluster management infrastructure works and how it is used Section 3 4 and section 3 6 introduce the cluster management GUI cmgui and cluster management shell cmsh respectively These are the primary applications that interact with the cluster through its manage ment infrastructure Section 3 7 describes the basics of the cluster management daemon CMDaemon running on all nodes of the cluster 3 1 Concepts In this section some concepts central to cluster management with Bright Cluster Manager are introduced 3 1 1 Devices A device in the Bright Cluster Manager cluster management infrastructure represents a physical hardware component that is part of a cluster A device can be any of the following types e Head Node e Node e Graphics Processing Unit e Ethernet Switch Bright Computing Inc 26 Cluster Management with Bright Cluster Manager InfiniBand Switch Myrinet Switch Power Distribution Unit Rack Sensor Kit e Generic Device A device can have a number of properties e g rack position host name switch port which can be set in order to configure the device Us ing the cluster management infrastructure operations e g power on may be performed on a device The property changes and operations th
388. ts as ecu eae SO ESE ee a ee Ee a 13 2 HA Set Up Procedure uoaa 13 3 Managing HA baur a eaa EE a a a a Generated Files Bright Computing Public Key CMDaemon Configuration File Directives Disk Partitioning D 1 Structure of Partitioning Definition D 2 Example Default Node Partitioning D 3 Example Preventing Accidental Data Loss D 4 Example Using custom assertions D 5 Example SoftwareRAID 0 0000 D 6 Example Logical Volume Manager D 7 Example Diskless 2 2 0 000000 000000 D 8 Example Semi diskless 0020000 Example initialize And finalize Scripts Quickstart Installation Guide F 1 Installing Head Node 2 0 2 2 0 00040 E2 First Boot s emmm be alee a ee ASS ee a F3 Booting Nodes 0 0 00 0000 a F4 Running Cluster Management GUI Workload Managers Quick Reference G 1 SunGrid Engine 2 2 0 0 0 0 000000 G2 Torgue a iil Baste edie RS Gites Se a hl a E A G3 PBS PrO na a ten Te ee ae ee a AS Metrics Health Checks And Actions H 1 Metrics And Their Parameters 0 H 2 Health Checks And Their Parameters H 3 Actions And Their Parameters Metric Collections I 1 Metric Collections Added Using Cmsh I 2 Metric Collections Initialization I 3 Metric Collections Output During Regular Use I 4 Error Handling
389. twork device the correct kernel mod ule needs to be loaded If this does not happen booting fails and the console of the node displays something similar to Figure 6 24 Bright Computing Inc 114 Node Provisioning Creating initial device nodes Setting up hotplug reating block device nodes Loading ehci hcd ko module oading ohci hcd ko module oading uhci hcd ko module oading jbd ko module Loading ext3 ko module oading sunrpe ko module oading nfs_acl ko module Loading fscache ko module oading lockd ko module oading nfs ko module oading scsi_mod ko module oading sd_mod ko module libata ko module ahci ko module reating root device inished original ramdisk for driver initialization an t configure the ethernet device used for booting ou should probably insert the correct kernel module into the ramdisk bin sh can t access tty job control turned off Figure 6 24 No Network Interface To solve this issue the correct kernel module should be added to the software image s kernel module configuration For example to add the e1000 module to the default image using cmsh Example mc softwareimage use default image mc gt sof twareimage default image kernelmodules mc gt s of twareimage default image gt kernelmodules add e1000 mc gt s of twareimage default image gt kernelmodules e1000 commit Initial ramdisk for image default image was regenerated successfully
390. ules directory of the node image shows if it is available Bright Computing Inc 116 Node Provisioning Example find cm images default image lib modules name mpt2sas If it is is not available the driver module must then be obtained If it is a source file it will need to be compiled By default nodes run on standard distribution kernels so that only standard procedures need to be followed to compile modules If the module is available it can be added to the default image using cmsh in softwareimage mode Example bright51 softwareimage use default image bright51 gt softwareimage default image kernelmodules bright51 gt softwareimage default image gt kernelmodules add mpt2sas bright51 gt softwareimage default image gt kernelmodules mpt2sas com mit bright51 gt softwareimage default image gt kernelmodules mpt2sas Thu May 19 16 55 43 2011 bright51 Initial ramdisk for image default im age is being generated bright51 gt softwareimage default image gt kernelmodules mpt2sas Thu May 19 16 56 31 2011 bright51 Initial ramdisk for image default im age was regenerated successfully bright51 gt softwareimage default image gt kernelmodules mpt2sas After committing the change it can take some time before ramdisk creation is completed typically about a minute as the example shows On rebooting the node it should now continue past the disk layout stage 6 7 6 Node Ins
391. up gt lt xs complexType gt lt xs sequence gt lt xs element name diskless type diskless minOccurs 0 maxOccurs 1 gt lt xs element name device type device minOccurs 0 maxOccurs unbounded gt lt xs element name raid type raid minOccurs 0 maxOccurs unbounded gt lt xs element name volumeGroup type volumeGroup minOccurs 0 maxOccurs unbounded gt lt xs sequence gt lt xs complexType gt lt xs key name partitionAndRaidIds gt lt xs selector xpath raid partition gt lt xs field xpath id gt lt xs key gt lt xs keyref name raidMemberIds refer partitionAndRaidIds gt lt xs selector xpath raid member gt lt xs field xpath gt lt xs keyref gt lt xs keyref name volumeGroupPhysicalVolumes refer partitionAndRaidIds gt lt xs selector xpath volumeGroup physicalVolumes member gt lt xs field xpath gt lt xs keyref gt lt xs unique name raidAndVolumeMembersUnique gt lt xs selector xpath member gt lt xs field xpath gt lt xs unique gt lt xs unique name deviceNodesUnique gt lt xs selector xpath device blockdev gt lt xs field xpath gt lt xs unique gt lt xs unique name mountPointsUnique gt lt xs selector xpath mountPoint gt lt x
392. uration The Additional Network Configuration screen Figure 2 10 allows In finiBand and IPMI iLO networks to be configured Clicking Continue here leads to the Networks configuration screen described next Bright Cluster Manager Installer Additional Network Configuration English US a Add Infiniband network Allow booting over Infiniband es Will nodes have IPMV iLO compatible BMCs Add IPMV ILO network Automatically configure IPMYiLO BMC when node boots amp Additional Networks Figure 2 10 Additional Network Configuration Networks Configuration The Networks configuration screen Figure 2 11 displays the predefined list of networks based on the selected network architecture IPMI Bright Computing Inc 14 Installing Bright Cluster Manager and InfiniBand networks are defined based on selections made in the Additional Network Configuration screen earlier Figure 2 10 The parameters of the network interfaces can be configured in this screen For a type 1 setup an external network and an internal network are al ways defined For a type 2 setup only an internal network is defined and no external network is defined For a type 3 setup an internal network and a management network are defined Clicking Continue in this screen validates all network settings Invalid settings for any of the defined networks cause an alert to be displayed explaining the error A cor
393. uster vi Switches Metric Configuration All Master Nodes amp switcho1 v Networks Metric Parameter w _ Log length datapoints Wi Sampling interval seconds v Soew S externalnet AlertLevel max 3000 N A v a Z intermalnet AlertLevel sum 3000 N A v Power Distribution Units AvgExgi sae iii v gt i Software Images v default image v vi Node Categories Monitoring configuration All Master Nodes v slave Metric CPUUser Parameter v Head Nodes v E myheadnode v E Chassis Severity amp Begin Actions v v v v v v v EVENT VIEWER Q o All Events v Time a Custer vU Source Wi Message R 28 Sep 2010 01 47 29 My Cluster myheadnode Service ntpd was started on myheadnode a 27 Sep 2010 10 12 26 My Cluster myheadnode Service ntpd died on myheadnode 27 Sep 2010 06 07 01 My Cluster myheadnode Check cputhreshcheck is in state PASS on myheadnode 27 Sep 2010 06 04 09 My Cluster myheadnode Check cputhreshcheck is in state FAIL on myheadnode 27 Sep 2010 06 04 08 My Cluster myheadnode Metric CPUUser value 83 7019 exceeded threshold on myheadnode 27 Sep 2010 05 58 00 My Cluster myheadnode Check cputhreshcheck is in state PASS on myheadnode 27 Sep 2010 05 58 00 My Cluster myheadnode Check mysq is in state PASS on myheadnode _ 27 Sep2010 04 29 00 My Cluster myheadnode Metric CPUUser value 97 0889 exceeded threshold on myheadnode Figure 10 18 cmgui
394. ut a number of system configuration files Some configuration files are written out in their entirety whereas other configuration files only contain sections that have been inserted by the cluster management daemon Appendix A lists all system configuration files that are generated A file that has been generated by the cluster management daemon contains a header This file was automatically generated by cmd Do not edit manually Sections of files that have been generated by the cluster management dae mon will read as follows This section of this file was automatically generated by cmd Do not edit manually BEGIN AUTOGENERATED SECTION DO NOT REMOVE END AUTOGENERATED SECTION DO NOT REMOVE When generated files or sections of files are modified manually the changes are automatically overwritten the next time the content is ac cessed an event is generated and the manually modified configuration file is backed up to var spool cmd saved config files Sometimes overriding the automatically generated configuration file contents may be necessary The FrozenFile configuration file directive in cmd conf allows this Example FrozenFile etc dhcpd conf etc postfix main cf Bright Computing Inc Configuring The Cluster After the Bright Cluster Manager software has been installed on the head node the cluster must be configured This chapter goes through a num ber of basic cluster configuratio
395. values For example a sendemail action with a parameter root can be appended to the Actions parameter which already has the killallyes action as a value This will send an e mail to the root mail account A get command can be run to see the values for the threshold actions Example metricconf CPUUser gt thresholds append killallyesthreshold actions sendemail root metricconf CPUUser gt thresholds get killallyesthreshold actions enter killallyes enter SendEmail root By default the actions are set to run on entering the threshold zone with an implied flag of e enter To run on leaving the threshold zone or to run during the time the value is within the threshold zone the flags 1 leave or d during must explicitly be applied to the actions command Bright Computing Inc 204 Cluster Monitoring In the example the Actions parameter now has the value of the the built in action name sendemail as well as the value of the action script name killallyes This means that both actions will run when the threshold condition is met consolidators If continuing on with the preceding example the consolidators mode is entered then the list command will list the consolidators running on the system On a newly installed system there are three consolidators by default for each metric set for a device cate gory Each consolidator has an appropriately assigned time Interval in sec
396. ve mycluster gt device node100 commit mycluster gt device node100 exit mycluster gt device In the above within device mode a new object node100 is added of type slavenode and with IP address 10 141 0 100 The category test slave is then added and the test slave object level within category mode is automatically dropped into when the command is executed This is usually convenient but not in this example where it is assumed the de vice node object still needs a property under it to be set To return to device mode again at the level it was left a multiple command device use node100 is executed The category property of the the node100 ob ject is set to the newly created category test slave and the object is then committed to store it permanently Note that until the newly added ob ject has been committed it remains a local change that is lost when cmsh is exited Asterisk tags in the prompt are a useful reminder of a modified state with each asterisk indicating a tagged object that has an unsaved modi fied property In most modes the add command takes only one argument namely the name of the object that is to be created However in device mode an extra object type in this case slavenode is also required as argument and an optional extra IP argument may also be specified The response to help add while in device mode gives details myheadnode gt device help add Usage add lt type gt lt ho
397. ve one of four values AUTO FULL MAIN and NOSYNC e Ifthe install mode is set to FULL the node installer re partitions cre ates new file systems and synchronizes a full image onto the local drive This process wipes out all pre boot drive content e If the install mode is set to AUTO the node installer checks the par tition table and file systems of the local drive against the node s stored configuration If these do not match because for example the node is new or if they are corrupted then the node installer recreates the partitions and file systems by carrying out a FULL in stall If however the drive partitions and file systems are healthy the node installer only does an incremental software image synchro nization Synchronization tends to be quick because the software image and the local drive usually do not differ much Synchronization also removes any extra local files that do not exist on the image for the files and directories considered Section 6 3 7 gives details on how it is decided what files and directories are con sidered e If the install mode is set to MAIN the node installer halts in main tenance mode allowing manual investigation of specific problems The local drive is untouched e If the install mode is set to NOSYNC and the partition or filesys tem check matches the stored configuration then the node installer Bright Computing Inc 6 3 Node Installer 99 skips synchronizing the ima
398. verview tab of nodes in the Health Status section 10 2 9 Flapping Flapping or State Flapping is when a state transition see section 10 2 10 occurs too many times over a number of samples In the basic example of section 10 1 if the CPUUser metric crossed the threshold zone 7 times within 12 samples the default values for flap detection then it would by default be detected as flapping A flapping alert would then be recorded in the event viewer and a flapping action could also be launched if con figured to do so Flapping configuration for cmgui is covered for thresh olds crossing events in section 10 4 2 when the metric configuration tab s Edit and Add dialogs are explained and also covered for health check state changes in section 10 4 3 when the health check configuration tab s Edit and Add dialogs are explained 10 2 10 Transition A state transition is e a health check state change for example changing from PASS to FAIL or from FAIL to UNKNOWN e a metric threshold section 10 2 3 crossing event This is only valid for values that Enter or Leave the threshold zone Bright Computing Inc 170 Cluster Monitoring 1 Visualization 4 Overview Of Monitored Data Yew Hep RESOURCES myheadnode T T My Clusters vE wy Cluster vE Switches E switcho1 v Neworks S exemalnet Uptime Memory S interainet Swap Memory CPU Usage Disk Usage Netw
399. w adaptivecomputing com resources docs and in particular the Torque administrator manual is available at http www adaptivecomputing com resources docs torque index php Installing The Maui Scheduler The Maui scheduler source version 3 2 6p21 is picked up from the Adaptive Computing website at http www adaptivecomputing com resources downloads maui index php It is installed over the zero sized placeholder file on the head node at usr src redhat SOURCES maui 3 2 6p21 tar gz Maui documentation is available at http www adaptivecomputing com resources docs maui index php The RPM file is built from the source on the head node for Bright Clus ter Manager 5 1 using rpmbuild bb usr src redhat SPECS maui spec and the installation is done with rpm i usr src redhat RPMS x86_64 maui 3 2 6p21 44_cm5 1 x86_64 rpm Installing The Moab Scheduler Moab is installed by default in Bright Cluster Manager 5 1 Once the trial license has expired a license must be obtained from Adaptive Comput ing Running Torque And Schedulers The Torque resource manager runs the following two daemons 1 a pbs_server daemon This handles submissions acceptance and talks to the execution daemons on the compute nodes when sending and receiving jobs It writes logs to the cm shared apps torque current spool server_logs directory on its node Queues for this service are configured with the qmgr command 2 apbs_mom execution daemon running on t
400. with DNS domain storage cluster one would use node001 storage cluster to reach node001 through the storage network Internal DNS zones are generated automatically based on the network definitions and the defined nodes on these networks For networks marked as external no DNS zones are generated 4 2 2 Adding Networks The Add button in the networks overview tab of Figure 4 2 can be used to add a new network After the new network has been added the Settings tab Figure 4 4 can be used to further configure the newly added net work After a network has been added it can be used in the configuration of network interfaces for devices The default assignment of networks internalnet to Management network and externalnet to External network can be changed in the Settings tab of the cluster object Figure 4 6 Bright Computing Inc 62 Configuring The Cluster X Bright Cluster Manager Eile Monitoring View Help ooa RESOURCES v s My Clusters E My Cluster internainet b 9 Power Distribution My Cluster Overview b E Software Images Administrator e mail b EJ Node Categories b Head Nodes gt Chassis Default category slave v0 Slave Nodes E node001 Default software image detault image g node002 b GPU Units b E Other Devices External network externalnet b Node Groups Management network _ internalnet Health Parallel shell gt
401. xample sets the filesystem to use 4MB blocks the start OST is chosen by the MDS and stripes data over all available OSTs Bright Computing Inc 13 High Availability In a cluster with a single head node the head node is a single point of fail ure for the entire cluster In certain environments it is unacceptable that a single machine failure can cause a disruption to the daily operations of a cluster Bright Cluster Manager includes high availability HA features which allow clusters to be set up with two head nodes 13 1 HA Concepts In a cluster with an HA setup one of the head nodes is called the primary head node and the other head node is called the secondary head node Under normal operation one of the two head nodes is in active mode whereas the other is in passive mode It is important to distinguish between the concepts of primary secon dary and active passive mode The difference between the two concepts is that while a head node which is primary always remains primary the mode that the node is in may change It is possible for the primary head node to be in passive mode when the secondary is in active mode Simi larly the primary head node may be in active mode while the secondary head node is in passive mode The central concept of HA is that the passive head node continuously monitors the active head node If the passive finds that the active is no longer operational it will initiate a failover sequence A fail
402. y set when express mode is selected is system Clicking on the Continue button brings up the Bright Cluster Manager software license screen described next Bright Computing Inc 2 3 Head Node Installation Bright Cluster Manager Installe Welcome to the Bright Cluster Manager Installer Welcome Q License Inform Version Edition Name Organization Unit Locality State Country Serial Valid from Valid until MAC address Licensed nodes Installation mo Normal r Expre Remote Installation English US Cluster Manager ation aul Advanced My Cluster Bright RD San Jose California US 2055 25 Jul 2010 19 Jul 2038 27 77 22 7 7 72 27 3 ae mmended Cancel Continue Figure 2 1 Installation welcome screen for Bright Cluster Manager Software License The Bright Computing Software License screen Figure 2 2 explains the applicable terms and conditions that apply to use of the Bright Cluster Manager software Accepting the terms and conditions and clicking on the Cont inue but ton leads to the Base Distribution EULA End User License Agreement Figure 2 3 Accepting the terms and conditions of the base distribution EULA and clicking on the Continue button leads to two possibilities 1 If express mode was selected earlier then the installer skips ahead to the Summary screen Figure 2 23 where it shows an overview of the predefined installation parame
403. y the netmask which uses CIDR nota tion CIDR notation is the so called slash representation in which for example a CIDR notation of 192 168 0 1 28 implies an IP address of 192 168 0 1 with a traditional netmask of 255 255 255 240 applied to the 192 168 0 0 network The netmask 255 255 255 240 implies that bits 28 32 of the 32 bit dotted quad number 255 255 255 255 are unmasked thereby implying a 4 bit sized host range of 16 i e 24 addresses The sipcalc utility installed on the head node is a useful tool for cal culating or checking such IP subnet values man sipcalc or sipcalc h for help on this utility Example user brightcluster sipcalc 192 168 0 1 28 ipv4 192 168 0 1 28 0 CIDR Host address 192 168 0 1 Host address decimal 3232235521 Host address hex COA80001 Network address 192 168 0 0 Network mask 255 255 255 240 Network mask bits 28 Network mask hex FFFFFFFO Broadcast address 192 168 0 15 Cisco wildcard 0 0 0 15 Addresses in network 16 Network range 192 168 0 0 192 168 0 15 Usable range 192 168 0 1 192 168 0 14 Every network has an associated DNS domain which can be used to access a device through a particular network For internalnet the de fault DNS domain is set to cm cluster which means that the hostname node001 cm cluster can be used to access device node001 through the primary internal network If a dedicated storage network has been added
404. ycluster gt category misc gt roles provisioning commit mycluster gt category misc gt roles provisioning Assigning a provisioning role can also be done for an individual node instead if using a category is deemed overkill Example mycluster device use node001 mycluster gt device node001 roles mycluster gt device node001 gt roles assign provisioning mycluster gt device node001 gt roles provisioning After carrying out a role change the updateprovisioners command described in section 6 1 4 should be run manually so that the images are propagated to the provisioners and so that CMDaemon is able to stay up to date on which nodes do provisioning Running it manually makes sense in order to avoid rerunning the command several times as typically several role changes are made for several nodes when configuring the provisioning of a cluster The command in any case runs automatically after some time section 6 1 4 6 1 3 Provisioning Nodes Role Setup With cmgui The provisioning configuration outlined in cmsh mode in section 6 1 2 can be done via cmgui too as follows The provisioning category is added by clicking on the Add button in the Overview tabbed pane in the Node Categories resource Figure 6 1 File Monitoring View Help RESOURCES F Node Categories E My Custer EM EORI Modified Name wi Defaultimage vi B gt Gj Switches v0 Networks externalnet internalnet gt
405. zation tool is covered later in section 10 3 using typical data from CPUUser from the basic example of section 10 1 2 Monitoring Configuration Selecting the Monitoring Configuration resource in cmgui from the Resources list on the left hand side of the Bright Cluster Man ager displays the monitoring configuration pane on the right hand side Within this pane sampling methods data storage and thresh old actions are configured and viewed Some parts of Monitoring Configuration were used in the basic example of section 10 1 to set up the threshold for CPUUser and to assign the action It is covered more thoroughly in section 10 4 3 Event Viewer The Event Viewer is a log of important events that are seen on the cluster s How the events are presented is configurable with tools that allow filtering based on dates clusters nodes or a text string Bright Computing Inc 10 3 Monitoring Visualization With Cmgui 171 and widgets that allow rearranging the sort order or detaching the pane 4 Overview Of Monitored Data A dashboard in a car conveys the most important relevant informa tion at a glance and attracts attention to items that are abnormal and merit further investigation The same idea lies behind the Overview tab of Bright Cluster Man ager This gives a dashboard view based on the monitored data for a particular device such as a switch a cluster probably the most use ful overview and therefore also the

Administrator Manual - Support

Contents

Download Pdf Manuals

Related Search

Related Contents