Home
86A287EW01-HPC BAS5 for Xeon V1 - Support On Line
Contents
1. Cornge Figure H 2 NovaScale R421 rear view connectors Connector Port Slot Use number 1 PCl Express x8 InfiniBand interconnect or Ethernet 1000 Backbone when slot 4 is used for Ethernet 1000 interconnect 2 PCI X 100MHz 64 bit 3 Ethernet Administration Network or BMC Network 4 Gbit Ethernet Ethernet 1000 interconnect or Ethernet Backbone when slot 1 is used for InfiniBand interconnects H 4 BASS for Xeon Installation and Configuration Guide Table H 3 NovaScale R421 Slots and Connectors H 3 2 NovaScale R422 Series Compute Node BMC BMC Double Eth 10 14 1000 Double Eth 19 104 1000 PCI Exp X8 LP Riser Exp X8 Without integrated InfiniBand Figure H 3 NovaScale R422 rear view of Riser architecture Common Power Supply With integrated InfiniBand IB Integ The ports attached to the North Bridge or the Memory Controller Hub MCH offer a higher performance than those attached to the Enterprise South Bridge ESB Note Depending on the model an on board InfiniBand controller with a dedicated port may be included The two servers within a NovaScale R422 machine are identical they either both include the InfiniBand controller or they both do not
2. F 1 Appendix G Configuring AOC USASLP S8iR RAID Adapters for NovaScale R423 and R425 machines G 1 Appendix PCI Slot Selection and Server H 1 H 1 How to Optimize I O Performan amp e iet eh et tr SU eet e e RAM UI M tin ice petes H 1 H2 Creating the list of Adapters etu pecie tastes i b H 2 H 3 Connections NovaScale R4xx SGIVels echo tote ius eh esa POR sea H 3 H 3 1 NovaScale R421 Series Compute Node iom H 3 H 3 2 NovaScale R422 Series Compute Node eoe eter e H 5 H 3 3 NovaScale R460 Series Service Node asse bed one pete eiae be e es H 7 Appendix Activating your Red Hat account cseeeseccceeeeeeeeeeeennneeeeeeeeeeeeees I1 Glossary and Acronyms 1 Index E MM T LM LL LEE l 1 BASS for Xeon Installation and Configuration Guide List of Figures Figure 1 1 Figure 1 2 Figure 1 3 Figure 1 4 Figure 1 5 Figure 1 6 Figure 1 7 Figure 1 8 Figure 1 9 Figure 1 10 Figure 1 11 Figure 1 12 Figure 1 13 Figure 3 1 Figure 3 2 Figure 3 3 Figure 3 4 Figure 3 5 Figure 3 6 Figure 3 7 Figure 3 8 Figure 3 9 Figure 3 10 Figure 3 11 Figure 3 12 Figure 3 13 Figure 3 14 Figure 3 15 Figure 3 16 Figure 3 17 Figure 3 18 Figure 3 19 Figur
3. Lustre must use dedicated service nodes for I O functions NOT combined Login IO service nodes NFS can be used on both dedicated I O service nodes and on combined Login IO service nodes BASS for Xeon Installation and Configuration Guide 1 2 4 Compute Nodes The Compute Nodes are optimized to execute parallel code Interconnect Adapters InfiniBand or Gigabit Ethernet must be installed on these nodes Bull NovaScale R421 R421 E1 R422 R422 E1 R425 and R480 E1 servers may all be used as Compute Nodes for BASS for Xeon v1 2 NovaScale R421 and R421 ET servers Bull NovaScale R421 and R421 E1 servers are double socket dual or quad core machines Figure 1 7 NovaScale R421 server Figure 1 8 NovaScale R421 E1 server NovaScale R422 and R422 ET servers Bull NovaScale R422 and R422 E1 servers are double socket dual or quad core machines Cluster Configuration 1 7 Figure 1 9 NovaScale R422 R422 1 machine NVIDIA Tesla 1070 accelerators NovaScale R422 E1 and R425 servers can be connected with external NVIDIA Tesla 1070 accelerators resulting in vast improvements in calculation times Each accelerator card is connected via an external port and 2 PCI cards to the server Figure 1 10 NVIDIA Tesla 51070 accelerator NovaScale R425 servers Bull NovaScale R425 servers are double socket dual or quad core machines and include a powerful PSU to suppo
4. E D5 Figure H 4 NovaScale R422 Rear view connectors PCI Slot Selection and Server Connectors H 5 H 6 Connector Port Slot Use number 1 PCI Express x8 InfiniBand Interconnect or Ethernet 1000 Backbone 2 LAN port Management Network or BMC Network 3 LAN port Gbit Ethernet or Gbit Ethernet Interconnect or Ethernet 1000 backbone 4 InfiniBand port optional InfiniBand Interconnect BASS for Xeon Installation and Configuration Guide Table H 4 NovaScale R422 Slots and Connectors H 3 3 NovaScale R460 Series Service Node ESI X4 BMC ESB2 PCI Exp X4 PCLExp X4 Double Eth P 10 100 1000 PCI Exp X8 PCI Exp X8 1 133 64 PCI Exp X8 PCI X 66 64 PCI X 66 64 Figure H 5 NovaScale R460 risers and I O subsystem slotting The ports attached to the North Bridge or the Memory Controller Hub MCH offer a higher performance than those attached to the Enterprise South Bridge ESB 3 2 5 A Figure H 6 Rear view of NovaScale R460 Series PCI Slot Selection and Server Connectors H 7 H 8 Connector Port Slot Use number 1 PCI Express x8 InfiniBand Double Data Rate Adapter 2 PCI Express x4 Fibre Channel Disk Rack 3 PCI Express 4 Fibre Channel Input Output 4 PCI Express x8 Optional backbone 10 Gigabit Ethernet Myricom Myri 10G x8
5. 3 22 3 2 4 Installing the Bull BASS for Xeon software cisco irte Er Ee o cs Ree s 3 22 3 2 5 Database GCORITQUEOHON S buste tu TM i od Arles 3 24 STEP 3 Configuring Equipment and Installing Utilities on the Management Node 3 26 men Generate the SSH k ys adi a octane Opin iss ten ioa cocti eu a cod a des 3 26 33 2 Configure essei ccr Rer ata Ded e S ot OP tenta eb voe 3 27 3 3 3 Configuring Equipment Manually ds sat emt a ie RUE SD citius 3 28 3 3 4 Configuring ies Gate eoe ome eas Ode pa ear tea meee faa 3 29 33 5 Configuring sns dentes m aped usu ue Rap sll 3 29 3 3 6 Configuring Management Tools Using Database Information 3 29 3 3 7 Config ring CNN Ax yay caer crete ust tanger ne ER Wen ca lass edet are EE V T AE 3 3 3 3 8 Configuring Syslog ng oasis eose eo S TON v ce ook cA tuned Meant 3 31 3239 Config ring NTP aS icol ics ere tet ata Beane Oat na op Our e E HUN 3 32 3 3 10 Configuring the kdump kernel dump tool oec Sh e mettent 3 33 3 3 11 Installing and Configuring SLURM optional 5 eii 3 34 3 3 12 3 39 3 3 13 Installing and Configuring PBS Professional Batch Manager optional 3 39 3 3 14 Installing Intel Compilers and Math Kernel 3 42 3 3 15 Configuring the MPI User environment 3 42 STEP 4 Installin
6. Table H 1 PCI X Adapter Table 1 If both channels are used Otherwise the adapter must be categorised as a single channel port adapter 2 Full duplex capability is not taken into account Otherwise double the value listed It may be possible that these values will be reduced due to the characteristics of the equipment attached to the adapter For example a U230 SCSI HBA connected to a U160 SCSI disk subsystem will not be able to provide more than 160 MB s bandwidth Adapter Bandwidth Infiniband Voltaire 400 or 410 EX D 1500 MB s Fibre channel dual ports 800 MB s Fibre channel single ports 400 MB s 2 Gigabit Ethernet dual port 250 MB s Gigabit Ethernet single port 125 MB s 2 Table H 2 PCI Express Table H 2 BASS for Xeon Installation and Configuration Guide H 3 Connections for NovaScale R4xx Servers The following paragraphs illustrate O subsystem architecture for each family of NovaScale Rxx servers H 3 1 NovaScale RA21 Series Compute Node ESI X4 BMC PCI Exp X4 Double Eth PCI Exp X4 j 140 100 1000 PCI Ex p X8 PCI X 133 64 PCI X 100 64 Figure H 1 NovaScale R421 rear view of Riser architecture The ports attached to the North Bridge or the Memory Controller Hub MCH offer a higher performance than those attached to the Enterprise South Bridge ESB PCI Slot Selection and Server Connectors H 3
7. Start with the following operations 1 Power up the machine 2 Switch on the monitor Installing BAS5 for Xeon v1 2 Software on the HPC Nodes 3 5 3 Insert the Red Hat Enterprise Linux Server 5 DVD into the slotloading drive Note media must be inserted during the initial phases of the internal tests whilst the screen is displaying either the logo or the diagnostic messages otherwise the system may not detect the device 4 Select all the options required for the language time date and keyboard system settings 5 Skip the media test Red Hat Linux Management Node Installation Procedure A suite of screens helps you to install RHEL5 software on the Service Node that includes the Management Node Services ED HAT NTERPRISE LINUX 5 RED HAT ENTERPRISE LINUX 5 Release Notes 8 Figure 3 1 The Welcome Screen 1 The Welcome screen will appear at the beginning of the installation process 3 6 BASS for Xeon Installation and Configuration Guide RED HAT ENTERPRISE LINUX 5 Select the appropriate keyboard for the system Slovenian Spanish Swedish Swiss French Swiss French latin1 Swiss German Swiss German latin1 Tamil Inscript Tamil Typewriter Turkish U S International Ukrainian United Kingdom 7 Ces ssec Figure 3 2 Keyboard installation screen 2 Select the language to be used for installation Cl
8. 3 Asking users to customize their environment by sourcing the opt mpi mpiBull2 your version share setenv mpiBull2 files Depending on the setup solution chosen the Administrator must define two things a default communication driver for their cluster and the default libraries to be linked with according to the software architecture In all the files mentioned above the following must be specified a AMPIBull2 COMM DRIVER this can be done by using the mpiBull2 devices d command to set the default driver For InfiniBand systems the name of the driver is ibmr gen2 b MPIBull2 PRELIBS variable must be exported to the environment containing the reference to the SLURM PMI library Some examples are provided in the files For a cluster using the OpenIB InfiniBand communication protocol the following line must be included in the mpiBull file mpibull2 devices d ibmr gen2 For a cluster using SLURM set the following line and add if necessary the path to the PMI library Installing BAS for Xeon v1 2 Software on the HPC Nodes 3 43 export MPIBULL2 PRELIBS lpmi When using the MPI InfiniBand communication driver memory locking must be enabled There will be a warning during the InfiniBand RPM installation if the settings are not correct The etc security limits conf file must specify both soft memlock and hard memlock settings according to the memory capacity of the hardware These should be set around A
9. 3 Eject the XLustre DVD ROM 2 5 BAS5 for Xeon v1 2 on the Management Node 1 Go to the release XBAS5V1 2 directory cd release XBAS5V1 2 2 Execute the install command install 3 Confirm all the installation options that appear 4 Optional for clusters which use SLURM Note Check that the operations described in Section 2 2 1 have been carried out before starting the installation and configuration of SLURM Note Install and configure SLURM on the Management Node as described in STEP 3 in Chapter Installing BASS for Xeon v1 2 Software on the HPC Nodes Munge will be included in the SLURM installation in STEP 3 above for clusters which use this authentication type for the SLURM components 2 5 1 Configure BAS5 for Xeon v1 2 Management Node The BASS for Xeon v1 2 Management Node will be configured automatically except for the files listed below where a manual intervention is required Syslog ng The BASS for Xeon v1 2 syslog ng conf file must be manually updated with the cluster details contained in BASS for Xeon v1 1 syslog ng conf file saved previously The 55 for Xeon v1 2 syslog ng conf file contains bug fixes This file should be used and NOT the BASS for Xeon v1 1 file Nagios cfg When BASS for Xeon v1 1 is updated to BASS for Xeon v1 2 Nagios will not start Use the old version of the nagios cfg file which has been renamed as nagios cfg rpmsave and re
10. Back 5 Next Figure 3 4 Skip screen for the installation number 3 The BASS for Xeon installation procedure requires that the Red Hat Installation Number is NOT entered now The Installation Number can be entered later so that you can benefit from the Red Hat support network Select Skip entering Installation Number You will also have to click on Skip as shown in Figure 3 4 Click on Next 2 See Appendix Activating your Red Hat account for important information regarding the use of installation numbers BASS for Xeon Installation and Configuration Guide RED HAT ENTERPRISE LINUX 5 software and data may be overwritten depending on your configuration choices Install Red Hat Enterprise Linux Server 3 a Choose this option to freshly install your system Existing Red Hat Enterprise Linux Server system This option will preserve the existing data on your drives Upgrade an existing installation Choose this option if you would like to upgrade your existing notes Next Figure 3 5 First RHELS installation screen 4 Select the option Install Red Hat Enterprise Linux Server as shown in Figure 3 5 o NJ portant The Upgrade an existing installation option is not described in this manual Contact Bull technical support for more information Note For new clusters which are installing BASS for Xeon for the first time
11. IP address of the HPC Storage Management station gt Community public b Using the Monitors tab associate the new template to each Service Processor by selecting the Monitor Using Template option Configuration of Storage Management 4 15 4 5 3 Complementary Configuration Tasks for EMC Clariion AX4 5 storage devices The disk array is configured via the Navisphere Express interface in a web browser using the following URLs http lt SPA ip address gt or http lt SPB ip address gt 1 Set the disk array name in the Manage Storage System page 2 Set the security parameters in the System Settings User Management page Add a username and a password for the administrator 3 Set the monitoring parameters in the System Settings Event Notification page Set SNMP Trap Destination IP address of the Management node 4 5 4 Configuring the EMC Clariion DGC Access Information from the Management Node 1 Install the Navisphere CLI rpm on the Administration Node Note This package is named navicli noarch rpm and is available on the EMC CLARiiON Core Server Support CD ROM which is delivered with an EMC Clariion storage system 2 Edit the etc storageadmin dgc_admin conf file and set the correct values for the security parameters including Navisphere CLI security options for naviseccli only same user and password must be declared on each disk array by using the command below
12. It is the customer s responsibility to save data and their software environment before using the procedure described in this chapter For example the etc passwd etc shadow files root ssh directory and the home directory of the users must be saved E b mem All the data must be saved onto a non ormattable media outside of the cluster It is recommended to use the tar or cp a command which maintains file permissions 3 0 1 Saving the ClusterDB 1 Login as the root user on the Management Node 2 Enter Su postgres 3 Enter the following commands cd var lib pgsql backups pg dump Fc C f var lib pgsql backups name of clusterdball sav clusterdb pg dump Fc a f var lib pgsql backups name of clusterdbdata sav clusterdb For example name of clusterdbdata sav might be clusterdbdata 2006 1 105 sav 4 Copy the two sav files onto a non formattable media outside of the cluster Installing BAS5 for Xeon v1 2 Software on the HPC Nodes 3 3 3 0 2 Saving SSH Keys of the Nodes and of root User To avoid RSA identification changes the SSH keys must be kept e keep the node SSH keys save the etc ssh directory for each node type Management Node Compute Node Login Node etc assuming that the SSH keys are identical for all nodes of the same type keep the root user SSH keys save the root ssh directory on the Management Node assuming that its content is ide
13. LAN network configuration id ftp local wait no user root server usr sbin in ftpd server args 4 instances 4 nice 10 only from 0 0 0 0 0 allows access to all clients bind XXX XXX XXX XXX local IP address Administration network configuration id ftp admin socket type stream wait no user root server usr sbin in ftpd server args only from xxx yyy 0 0 16 only for internal use bind 0 99 local IP address Note The configurations above can be adapted and used by other services Binding Services to a Single Network F F 2 BASS for Xeon Installation and Configuration Guide Appendix G Configuring AOC USASLP S8iR RAID Adapters for NovaScale R423 and R425 machines Note operations described in this chapter have to be carried out individually on each NovaScale R423 and R425 machine included in the cluster 1 Reboot the machine via conman from the Management Node Press Ctrl A after the Adaptec RAID BIOS line appears to enter the Adaptec RAID Configuration Utility as shown below root xena0 Figure G 1 Boot screen with Adaptec RAID BIOS 2 Select the Array Configuration Utility from the Adapter RAID Configuration Utility Options menu as shown below lt lt lt Adaptec RAID Configuration Utility gt gt gt eett9999999999999999999299999999999999999999999999999999999999999999999999999999924 e t999999999999999999999999999999
14. SlurmEventHandler usr lib clustmngt slurm slurmevent Note Ifthe value of the ReturnToService parameter in the slurm conf is set then when a node that is down is re booted the administrator will have to change the state of the node manually with a command similar to that below so that the node appears as idle and available for use scontrol update NodeName bass State idle Reason test To avoid this set the ReturnToService parameter to 1 in the slurm conf file See e slurm conf man page for more information on all the configuration parameters including the ReturnToService parameter and those referred to above e https computing lInl gov linux slurm documentation html for an example of the configurator html tool for SLURM version 1 3 2 and the parameters that it includes slurm conf file example ControlMachine bali0 ControlAddr baliO SlurmUser slurm SlurmUID 105 SlurmGroup slurm SlurmGID 105 SlurmHome home slurm AuthType auth munge SlurmctldPort 6817 SlurmdPort 6818 SlurmctldLogFile var log slurm slurmctld log SlurmdLogFile var log slurm slurmd log h StateSaveLocation var log slurm log slurmctld SlurmdSpoolDir var log slurm log slurmd SlurmctldDebug 3 default is 3 SlurmdDebug 3 default is 3 SelectType select linear SchedulerType sched builtin default is sched builtin BASS for Xeon Installation and Configuration Guide 3 3 11 3 3 3 11 4
15. optional e PBS Professional does not work with SLURM e Flexlm License Server has to be installed before PBS Professional is installed see section 3 3 13 1 PBS Professional has to be installed on the Management Node before it is installed on the COMPUTE X LOGIN reference nodes See e Chapter 4 in the PBS Professional Installation and Upgrade Guide available on the PBS Professional CD ROM for more information on the installation for PBS Professional described below e Chapter 3 in the PBS Professional Administrator s Guide available on the PBS Professional CD ROM for more information on the configuration routine for PBS Professional described below 855 1 Starting the installation of PBS Professional The commands for the installation have to be performed by the cluster Administrator logged on as root 1 and extract the package from the PBS Pro CD ROM to the directory of choice on the COMPUTE X Reference Node using a command similar to that below cd root PBS tar xvzf PBSPro_9 2 0 RHEL5_x86_64 tar gz 2 Goto the installation directory on each node and run cd PBSPro_9 2 0 3 Start the installation process INSTALL Follow the installation program During the PBS Professional installation routine the Administrator will be asked to identify the following Execution directory The directory into which the executable programs and libraries wil
16. optional re ree OS OR edt 3 59 3 5 12 Installing RAID Monitoring Software optional ssssssssseee 3 60 3 6 STEP 6 Creating and Deploying an Image Using Kis 3 61 3 6 1 Installing Configuring and Verifying the Image 3 61 3 6 2 Creating ari Image osc hss acseadcasiadnesastasdesuabasagasatisaeatassetesaedelsesiseessjalousaseneas 3 62 3 6 9 Deploying the Image on the Cluster e eer eoe xe Rer ea d ER Hea 3 63 3 6 4 Post Deployment Node configuration resi rod vehe sete vto e d ble 3 63 Ao SSTERZ Dna Clute pe iicet eiat 3 65 1 Testing EHE pfe Resa matu 3 65 morus Checking NIP tub teu ee Diet ei 3 66 2449 CheckingSyslogdtssiuescutceuteni etse ege see DA tte he sce epis 3 66 3 74 Checking a ed DE eh die DR unica ERR VR I ba Nn 3 66 3 7 5 Checking ASCH Ac atlas ica on oh uius kdo M iue 3 69 37 6 Checking cucine Dos DUE ERE DENDO pe D RAT DIET 3 69 3 7 7 Testing PBS Professional Basic setup zer uc n HD DOR rm Fl SA Re DNE sonia 3 70 3 7 8 Checking and Starting the SLURM Daemons on COMPUTE X and Login IO Nodes 3 72 3 7 9 EMISIT 372 Chapter 4 Configuring Storage Management 4 1 4 1 Enabling Storage Management Services ssssssssssssseeeeeeneeee nenne 4 2
17. 1 Switch to postgres Su postgres 2 Go to the install directory cd usr lib clustmngt clusterdb install 3 Remove the existing cluster DB dropdb clusterdb 4 Create a new cluster DB schema create clusterdb sh nouser 5 Truncate the default values psql U clusterdb c truncate config status truncate config candidate clusterdb 6 Run the command psql U clusterdb c alter table ic switch alter column admin ipaddr drop not null clusterdb 7 Restore the sav files saved previously pg restore Fc disable triggers d clusterdb var lib pgsql backups name of clusterdb saved file gt 8 Go back to root by entering the exit command exit Initializing the Cluster Database using the preload file Contact Bull Technical Support to obtain the Cluster DB preload file for BAS5 for Xeon v1 2 and then follow the procedure described in section 3 2 5 1 in the BASS for Xeon Installation and Configuration Guide for the initialization of the Cluster Database Cluster Database Operations B 3 B 4 BASS for Xeon Installation and Configuration Guide Appendix C Migrating Lustre For Lustre 1 6 3 and above the following upgrades are supported e Lustre 1 4 x version to latest Lustre 1 6 x version minor version to the next for example 1 6 2 gt 1 6 3 The complete migration procedure is described in the Upgrading Lustre chapter in the lustre
18. ette oa HE o 3 69 Boot screen with Adaptec RAID BIOS ore cet e nr G 1 RAID Configuration Utility Options menu gt Array Configuration Utility G 1 Array Configuration Utility Main Menu red G2 Example of Array Properties for a RAID 5 Array G2 Example of Array Properties for a RAID 1 array ssssssssseeee G 3 Example of drive list for server ep cav NAMEN M oer URS ai S G 3 Selection of drives of the same size for new RAID G 4 Array Properties Array Type Mew G 4 Array Properties Write caching G 5 Array Properties Confirmation screen G 5 RAID Configuration Utility Options G 6 RAID Configuration Utility Options Menu gt Controller G4 Table of Contents xi Figure G 13 Figure G 14 Figure G 15 Figure G 16 Figure G 17 Figure H 1 Figure H 2 Figure H 3 Figure H 4 Figure H 5 Figure H 6 SMC AOC USAS SBIR Controller S s nee t dr c OU OD DR SOLES G 7 SAS PHY Settings oett Revera ine etim
19. mount dev cdrom media cdrecorder 3 Copy the BASS for Xeon v1 2 XHPC DVD ROM contents into the release directory cp a media cdrecorder release XBAS5V1 2 4 Eject the XHPC DVD ROM Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 21 3 2 Preparing the Installation of the BAS5 for Xeon optional software According to cluster type and the software options purchased the preparation of the installation of the Bull XIB software and or the XLustre software will now need to be done The release XBAS5V 1 2 directory already created for the XHPC software will be used so the only thing to do is copy the XIB and XLustre software across as follows Preparation for XIB software installation 1 Insert the BAS5 for Xeonv1 2 XIB DVD ROM into the DVD reader and mount it mount dev cdrom media cdrecorder 2 Copy the BASS for Xeon v1 2 XIB DVD ROM contents into the release directory as shown below unalias cp cp a media cdrecorder release XBAS5V1 2 Note Ifthe unalias cp command has already been executed the message that appears below can be ignored bash unalias cp not found 3 Ejectthe XIB DVD ROM Preparation for XLustre software installation 1 Insert the BASS for Xeon v1 2 XIUSTRE DVD ROM into the DVD reader and mount it mount dev cdrom media cdrecorder 2 Copy the BASS for Xeon v1 2 XLUSTRE DVD ROM contents into the release dir
20. remove the Ox part of the node GUID as shown below For interconnects which use Voltaire 4 x firmware you should always prepend Ox to the NodeGUID Installing and Configuring InfiniBand Interconnects 8 9 switchname config sm spines add 1 0008f10400411946 The change will take effect after the next reconfiguration of the fabric Repeat this procedure for all spines The NodeGUID has to be declared for each spine included in the Switch topology by running the add option separately for each spine Note An ISR 9288 2012 switch has 4 fabric boards each of them using 3 ASICs so these types of switches have 4 x 3 12 spines This will provide output similar to that below for a cluster with 12 spines 8 3 3 1 Listing configured spines Once the NodeGUIDS have been declared check that the GUID details have been updated by running the command below switchname config sm spines show Sample output for 1 spine 1 0x0008f10400401e61 2 0x0008 104004016e62 3 0 0008 10400401 63 4 0 0008 104004018 5 5 0x0008 104004018d6 6 0x0008 104004018d7 7 0x0008 10400401e4d 8 0x0008 10400401e4e 9 0x0008 10400401e4f 10 0x0008 10400401e19 11 0x0008f1040040lela 12 0 0008 10400401 1 Alternatively IBS tool version gt 0 2 8 be used to produce same information as follows ibs a showspines s subnet manager IP address or hostname 8 10
21. 4 mm P TPPTTTTTTTTTTTTT12127 2717111 2 0 Benere eo ee Yes POdCOOOCCOOCDOOCCOOCODOCOOUCOC No IZIZIZIZIIZIIZIZIIZZEZZIZIIZIZZZZZZIZZIIIIZIII L ue d 9t 999999999 TT TT TT TTTTTTS SESS e t TT TT T TT TT TT TIT IZIZZIIZIIZIIIZIIZZRZZZIZZIZIZZIZZZZZIZZZZZPZPZZIZZIZZZIZZIZIZZZZPPZPZPZZZPZZIZZZZZPZPZIPZZIZIZIZPZPIIIII IZIZZIZZIIZIZIZIZZZIZZZZIZZZZIZZZIZIZZZZIZZZIZZIZZIZZZIZZZZZZZZZZZZZZZZZZZZZZIZZIZIZZZIZZIZZILZIII SPSS SESE 99999999999 9999999999999999999999999999999999999999999999999999999999999 SPSS SESE SESE SEES EEE EE EEE EEE EEE EEE EEE EEE EEE EE SESE SESE SE E EE SESEEESESESESESESESE SPSS SESS SESS SESS SESS SESE SESS SESE SESE SESE SESE SESE SESE SESS SESE SES ESSESSESESESSESSESESEES SPSS SS SESE SSS ESSE ESSE ESE SESE SESE SESE SESE SESE SE SESE SESESESESESESESESESESESESESESE ISZIZZIZZIZIZIZZZIZZZIZIZZZIZZZZZZZZIZZZZZZZZZIZZZZIZZIZZZZZZZZZZZZZIZZZZZZZZZZIZZIZZZZZZZIZZIII LEE EEE EEEE EEEE E EEE EEEEEEEEEEEEEEEEEEEEEEETEEEETEEETEEEEEEEEEEEEEEEEEEEEEEEEEEEETEEEEEEEEEEEEE EE 60 HLIZIZZZIZIZIIIZZIIIZIIZIZIZIZIZIZIZIZIIZIZZIXZIZIZIZIZIZZIZIZIZIZIZZIZIZIZIIIZZIZIIIIIIIII Arrow to move cursor Enter to select option lt Esc gt to exit default Figure G 17 RAID Configuration Utility Exit Utility menu 17 Select Yes from the Exit Utility menu to confirm settings and press Enter The Rebooting the system message will appear Once the system has rebooted the new RAID will have been con
22. COMPUTE group A 2 IO group C 3 46 5 for Xeon Installation and Configuration Guide 3 LOGIN group C 4 COMPUTEX group B Enter the node functions required using a comma separated list when more than one product is to be installed for example 2 3 4 6 The Bull BASS for Xeon optional HPC product s to be installed for the cluster as shown below By default the Bull XHPC software is always installed Select any optional Bull HPC software product s to be installed N B The media corresponding to your choice s must have been copied into the release XBAS5V1 2 directory NONE XIB XLUSTRE XTOOLKIT C PN PD Oo Enter the product s to be installed using a comma separated list when more than one product is to be installed for example 1 2 7 The IP address of the NFS server node This must be the same node as the one on which the script runs 8 A list of the different nodes that are included in the Cluster database will be displayed as shown in the example below The node name s of the node s to be installed must then be entered using the following syntax basename 2 15 18 The use of square brackets is mandatory Node names Type Status basenamel not_managed basenameO0 A up basename 1076 1148 not_managed basename 26 33 309 1075 C up basename 2 23 I up The nodes that are included in the Cluster database are shown above
23. Enter the ftp menu Type the following command importFile group tmp Leave the ftp menu by typing exit Enter the group menu Type the following command group import 8 6 3 Importing a new group csv file on a switch running Voltaire 3 X firmware Assuming the FTP server is set up properly import the group csv file located in tmp switchname config ftp importFile group tmp Note This action takes place using the config ftp menu Once this is done enter the group menu and import the file as follows switchname config group group import 8 14 55 for Xeon Installation and Configuration Guide Racks 3 Elements 220 Normal events 3 Warning events 18 Error events 0 8 6 4 Importing a new group csv file on a switch running Voltaire A X firmware Assuming the FTP server is set up properly import the group csv file located in tmp switchname config group group import tmp Summary report Racks 3 Elements 20 Normal events 3 Warning events 18 Error events 0 Installing and Configuring InfiniBand Interconnects 8 15 8 7 Verifying the Voltaire Configuration The following Command Line Interface commands can be used to verify basic system parameters 1 To display the version of the current software version show 2 To display the ftp server configuration ftp show Optional 3 To display the management interface IP address and configuratio
24. Enter the list of nodes to be installed using NFS syntax example basename 2 15 18 Note BASS for Xeon optional HPC products can be installed later manually see Appendix 9 A detailed summary is then displayed listing the options to be used for the installation as shown in the example below The Administrator has to confirm that this list is correct or exit the installation INSTALLATION SUMMARY PXE boot files will be copied from release RHEL5 1 images pxeboot Path containing Linux Distro release RHEL5 1 FS Server IP address is 10 30 1 99 Serial Line option is ttyS1 115200 Partitioning method is auto The following hexa file s will be generated in tftpboot pxelinux cfg OA1F0106 The path containing Bull HPC installer release XBAS5V1 2 Installation function s IO LOGIN Optional HPC product s XIB XLUSTRE Installing BAS for Xeon v1 2 Software on the HPC Nodes 3 47 Please confirm the details above or exit confirm exit Note hexa files will be created in the tftpboot pxelinux cfg directory These files are called hexa files because their name represents an IP address in hexadecimal format and they are required for the PXE boot process Each file corresponds to the IP address of a node For convenience the preparenfs script creates links to these files using the node names 10 A line appears regarding the use of nsctrl commands to reboot the node where th
25. If you wish to keep the partitioning options as they were previously click on Reset in the screen above as shown in Figure 3 9 and confirm the settings including the mount point that appear 3 12 5 for Xeon Installation and Configuration Guide RED HAT ENTERPRISE LINUX 5 Drive dev sda 70002 MB Model HITACHI HUS151473VLS300 isda2 69900 MI Confirm Reset Are you sure you want to reset the partition table to its original state New Device LVM Volume Groups V VolGroupoo 69888 LogVol01 swap 1984 LogVol00 ext3 67904 v Hard Drives Hide RAID device LVM Volume Group members Release Notes Figure 3 11 Confirmation of previous partitioning settings 3 1 5 Network access Configuration RED HAT ENTERPRISE LINUX 5 Edit Interface Intel Corporation 80003ES2LAN Gigabit Ethernet Controller Copper Active on Boot Device IPv4 Ne Hardware address 00 16 17 3 9 76 Network Devices ethO 172 19 Enable IPv4 support B ethl DHCP Dynamic IP configuration DHCP Manual configuration IP Address Prefix Netmask Hostname 172 19 1 60 z 255 255 0 0 Set the hostname Enable IPv6 support 3 Automatic neighbor discovery manually xenaO Dynamic IP configuration DHCPv6 Manual configuration Miscellaneous Settings Gateway Primary DNS S
26. This is done by using the commands below scancel state pending and scancel state running Uninstall existing version of SLURM For clusters which include versions of SLURM earlier than 1 3 2 all files including the Pam_Slurm module RPMs config files must be completely uninstalled before starting the updating operation Note Save the slurm conf file as the information that it contains can be re used when regenerating the new slurm conf file The command below can be used to check the version of the SLURM files that are installed rpm qa slurm The existing SLURM files can be deleted using the command below rpm qa slurm xargs rpm e Uninstall Munge optional If the MUNGE authentication type is used then the existing versions of the MUNGE files will have to be uninstalled The command below can be used to check the version of the MUNGE files that are installed rpm qa munge BASS for Xeon Installation and Configuration Guide The existing MUNGE files can be deleted using the command below rpm qa munge xargs rpm e 2 2 1 4 SLURM Configuration file It is recommended that the slurm conf file is rebuilt using the configurator html tool that comes with SLURM version 1 3 2 The cluster information included in the existing slurm conf file can be reused however new parameters and extra options have been added for example for the partition parameter
27. dgc_cli_security User user Password password Scope 0 4 16 55 for Xeon Installation and Configuration Guide 4 6 Updating the ClusterDB with Storage Systems Information 1 For each storage system run the command below storregister u n disk array name As a result the ClusterDB should now be populated with details of disks disk serial numbers WWPN for host ports and so on 2 Check that the operation was successful by running the command below storstat d n disk array name H If the registration has been successful all the information for the disks manufacturer model serial number and so on should be displayed 4 7 Storage Management Services The purpose of this phase is to build and distribute on the cluster nodes attached to fibre channel storage systems a data file which contains a human readable description for each WWPN This file is very similar to etc hosts It is used by the lsiocfg command to display a textual description of each fibre channel port instead of a 16 digit WWPN 1 Build a list of WWPNs on the management station lsiocfg W etc wwn Note This file must be rebuilt if a singlet is changed or if FC cables are switched or if new LUNs are created 2 Distribute the file on all the nodes connected to fibre channel systems for example all the I O nodes The file can be included in a KSIS patch of the Compute Nodes The drawback is that there a
28. stormap L All device aliases listed must return an up status Quorum disks If one or more LUNs for a storage system have been declared as quorum disks for Cluster Suite the configuration formatting of these devices as quorum disks is done automatically Use the command below on each node that is included in a High Availability pair to check this mkqdisk L Restoring a node After restoring the system on a node the aliases also have to be restored using the deployment command below from the Management Node stordepmap m model name i node name 5 2 Manual Configuration of I O Resources 2 b It is not recommended to configure the resources manually except for those storage systems where automatic configuration is not supported i e Optima 1250 and EMC CLARiiON AX4 5 Sue Manual Configuration of Storage Systems Please refer to the documentation provided with the storage system to understand how to use the storage vendor s management tools Most of the configuration operations can also be performed from the Management Node using the CLI management commands ddn admin nec admin dgc admin xyr admin commands provided by the storage administration packages See The BASS for Xeon Administrator s Guide for more information S42 Note Manual Configuration of I O resources for Nodes All the storage systems connected to the nodes must have been config
29. 4 2 Enabling FDA Storage System 4 3 4 2 1 Installing and Configuring FDA software on a Linux 4 4 4 2 2 Configuring FDA Access Information from the Management 4 6 4 2 3 Initializing the FDA Storage System ueteris Con te em retos dass 4 6 4 3 Enabling DataDirect Networks DDN S2A Storage Systems 4 8 4 3 1 Enabling Access from Management Node ssssssssssseeeees 4 8 4 3 2 Enabling Date and Time Control e ae E RE t DSi RU eed 4 8 4 3 3 Enabling Event Log Archiving a oet ae oll ea Di QURE IR Ua tU edes d NI 4 8 4 3 4 Enabling Management Access for Each DDN sssssssssssseeeee 4 8 4 3 5 Initializing the DDN Storage Systems n e t tt EIER EHE veda 4 9 4 4 Enabling the Administration of an Optima 1250 Storage 4 12 4 4 Optima 1250 Storage System Management 4 12 4 4 2 Initializing the Optima 1250 Storage System 4 12 4 5 Enabling the Administration of EMC Clariion DGC storage 4 14 4 5 1 Initial Configuration EO NI OO NAR HIN tte 4 14 4 5 2 Complementary Configuration Tasks for EMC Clariion CX s
30. 55 for Xeon Installation and Configuration Guide Spine nodeguids currently configured in the subnet manager 0x0008 10400411946 8 3 3 2 Activating changes Now that the topology and the spines have been defined activate the changes as follows switchname config sm sm info sm initiate fabric configuration set switchname config sm sm info sm initiate fabric reconfiguration set switchname config sm sm info sm initiate routing reconfiguration set Note These commands will interrupt all InfiniBand traffic so be sure to stop all the jobs that are running before using them Confirm that the new settings have been implemented by running the sm info show command switchname config sm sm info show Example output subnet manager info is smName zeus port guid 0008f1040041254a topology 3 stage CLOS active topology 3 stage CLOS algorithm up down active algorithm up down sm KEY 0000000000000000 sm priority 3 sm sweep interval seconds 15 sm verbosity mode error sm topology verbosity none sm mads pipeline 16 sm polling retries 12 sm activity 66049 sm state master sm mode enable sm LMC 3 sm hoq 16 sm slv 16 sm mopvl v10 14 subnet prefix 0xfe80000000000000 port state change trap nabl bad ports mode disable pm mode enable grouping mode enable 8 4 Performance manager PM setup The performance manager is a daemon running on a mana
31. After the system has rebooted the Administrator must configure the list of post boot settings which appear In particular the follow settings MUST be made Disable the firewall Disable SELinux Enable Kdump and select 128 MBs of memory for the kernel dump 13 The time and date must be set 14 Select Register later for the software update 15 The option Create the Linux user appears and can be set if required 16 Ignore the No sound card screen which appears 17 Ignore the Additional CDs screen 18 Click on Finish 19 Click on Reboot 3 9 Network Configurations Note The IP addresses used will depend on the address plan for the system Those used in this section are examples To configure the network use the system config network command as below this will launch the graphical tool used for the configuration system config network 24 21 Administration Network Configuration Note section only applies for those devices which have not been configured earlier if you wish to change an existing address Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 17 3 19 2 3 1 9 3 3 18 Configure other network interfaces e g eth1 eth2 if required Example 1 In the Devices panel select device eth1 2 Click Edit 3 Select Activate device when computer starts 4 Select Statically set IP addresses and set the following values according to your cluster type IP ADDRESS XXX
32. Answer the questions with y e Save the license in the etc subdirectory cp license dat etc e Run the command install sh Answer the questions with y e Run the command opt intel itac rel_number_1 etc itacvars sh 7 2 BASS for Xeon Installation and Configuration Guide For more details about the installation procedure you can read the Intel Trace Collector User s Guide on the internet site http www intel com software products cluster 7 6 Updating Intel Compilers and BASS for Xeon v1 2 BASS for Xeon V1 2 has been validated with Intel C C and Fortran version 10 1 011 compilers for Linux It will work with later 10 x compiler and MKL releases provided that the Bull intelruntime 10 1 011 RPM is NOT installed and the Intel runtime for the compilers and MKL libraries is made available for all the Compute or Extended Compute Nodes If the intelruntime 10 1 011 RPM has been installed it can be uninstalled using the following command rpm u intelruntime 10 1 011 RPM Two possible methods exist for updating compiler and MKL versions e Install the Intel compilers and MKL libraries on the reference COMPUTE or COMPUTEX Node and redeploy the reference node image using the KSIS tool e Install the Intel compilers on the Login Nodes Then export the opt intel directory via NFS and mount it on the COMPUTE or COMPUTEX Nodes If an Intel license is not available for the node the compiler w
33. Back Figure 3 8 Modifying the partitioning layout 1st screen Tick the Review and modify partitioning layout box as shown above Installing BAS5 for Xeon v1 2 Software on the HPC Nodes 3 11 RED HAT ENTERPRISE LINUX 5 Installation requires partitioning of your hard drive By default a partitioning layout is chosen which is reasonable for most users You can either choose to use this or create Pang You have chosen to remove all Linux partitions and ALL DATA on them on the following drives Remove linux partit Select the drive dev sda Are you sure you want to do this 1 Q es dp Scere Sega ae Review and modify partitioning layout Q Release Notes Figure 3 9 Confirmation to remove existing partitions b Click Yes above to confirm the removal of all existing Linux partitions RED HAT ENTERPRISE LINUX 5 Drive dev sda 70002 MB Model HITACHI HUS151473VLS300 isda2 169900 MB Nw Delete Reset rao ww Mount Point Size F Device RAID Volume ormat MB Start End LVM Volume Groups V 69888 IL LogVol01 swap 4 1984 LogVol00 ext3 67904 vv Hard Drives iv Hide RAID device LVM Volume Group members Release Notes Back l Figure 3 10 RHEL5 Partitioning options screen
34. Bull NovaScale Master 5 3 z WELCOME gt Home gt HPC Solutions Documentation gt Configuration Welcome to Bull NovaScale Master HPC Edition gt Click on HPC Solutions for more information about Bull HPC solutions NovaScale amp Express5800 Click on Documentation for more information about NovaScale Master Start Console Systems D gt click on configuration to configure Novascale Master Monitoring and Administration PP You can startthe NovaScale Master Console by clicking on the Start Console link Termin E Non param tr Figure 3 18 NovaScale Master Welcome screen An authentication window appears asking for user name and password Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 67 3 68 Authentification requise http xenaO Utilisateur 9 Entrez le nom et le mot de passe pour NovaScale Master Console Authentication Access nagios Mot de passe Utiliser le gestionnaire de mots de passe pour se souvenir de mot de passe Annuler OK Figure 3 19 NovaScale Master Authentication Window 3 Once authenticated the NovaScale Master console appears http xena0 NovaScale Master 5 3 0 Console Mozilla Firefox Bult NovaScale Master xenad Welcome to NovaScale Master Server 0 Login nagios Role Adminis
35. Ejectthe XHPC DVD ROM Pre installation Operations for 55 for Xeon v1 2 Optional Software According to the cluster type and the software options purchased the preparation for the installation of the Bull XIB software and or the XLustre software must now be done The release XBAS5V 1 2 directory already created for the XHPC software on the Management Node will be used so the only thing to do is copy the XIB and XLustre software across as follows XIB software installation 1 Insert the BAS5 for Xeon v1 2 XIB DVD ROM into the DVD reader and mount it mount dev cdrom media cdrecorder 2 Copy the BASS for Xeon v1 2 XIB DVD ROM contents into the release directory as shown below unalias cp cp a media cdrecorder release XBAS5V1 2 Note If the unalias cp command has already been executed the message that appears below can be ignored bash unalias cp not found 3 Ejectthe XIB DVD ROM XLustre software installation 1 Insert the BASS for Xeon v1 2 XIUSTRE DVD ROM into the DVD reader and mount it mount dev cdrom media cdrecorder 2 Copy the BASS for Xeon v1 2 XLUSTRE DVD ROM contents into the release directory unalias cp cp a media cdrecorder release XBAS5V1 2 BASS for Xeon Installation and Configuration Guide Note If the unalias cp command has already been executed the message that appears below can be ignored bash unalias cp not found
36. L All device aliases listed must return an up status 5 6 BASS for Xeon Installation and Configuration Guide Note For some storage systems not including FDA and DDN the stordiskname command may return an error similar to the one below Error This tool does not manage configuration where a given UID appears more than once on the node If this happens try running it with the m SCSI ID option The stordiskname command builds a etc storageadmin disknaming conf file which contains among other things details of symbolic link names the LUN UIDs and the WWPN access for the LUN s Only the stordiskname command can create or modify the node specific information in this file Quorum disks If one or more LUNs on a storage system have been configured as quorum disks for Cluster Suite aliases will also be created for these LUNs but it is important NOT to use these LUNs for other purposes apart from quorum disks On each node that is included in a High Availability pair use the commands below to check this mkqdisk L stormap L Restoring a node b The disknaming conf file will erased when redeploying the ksis reference image when the system is restored for a node Therefore the stordiskname command should be used with the r option remote from the Management Node enabling backups and restorations of the etc storageadmin disknaming conf file to b
37. OR 1 Gbit Ethernet Intel 82571 Ethernet Controller 5 PCI X 66 MHz 64 bit 6 PCI X 66 MHz 64 bit 7 Ethernet Dedicated Board Management Controller BMC connector for the BMC network 8 Ethernet Administration Ethernet Connector 9 Ethernet Gigabit Ethernet Interconnect Table H 5 NovaScale R460 Slots and Connectors Note Either slot number 1 is used for InfiniBand interconnects OR connector number 9 is used for Gigabit Ethernet interconnects These networks are exclusive BASS for Xeon Installation and Configuration Guide Appendix Activating your Red Hat account The command rhnreg_ks can be used to activate your Red Hat account For full details regarding installation numbers and activating your Red Hat account see http www redhat com support resources faqs installation_numbers index html what_is AN WARNING Do not update the Red Hat RPMs from the Red Hat web site as Bull cannot guarantee the continued functioning of your BASS for Xeon cluster Contact Bull technical support for more information regarding when the Red Hat and Bull RPMs can be updated Activating your Red Hat account 1 1 1 2 BASS for Xeon Installation and Configuration Guide Glossary and Acronyms A ACT Administration Configuration Tool API Application Programmer Interface ARP Address Resolution Protocol B BAS Bull Advanced Server BIOS Basic Input Output System C CMOS Complementary Metal Oxide Semi Con
38. SFP Small Form factor Pluggable transceiver extractable optical or electrical transmitter receiver module SEL System Event Log SIOH Server Input Output Hub Glossary and Acronyms G 3 SLURM Simple Linux Utility for Resource Management an open source highly scalable cluster management and job scheduling system SM System Management SMP Symetric Multi Processing The processing of programs by multiple processors that share a common operating system and memory SMT Symetric Multi Threading SNMP Simple Network Management Protocol SOL Serial Over LAN SSH Secure Shell T TFTP Trivial File Transfer Protocol U USB Universal Serial Bus UTC Coordinated Universal Time V VDM Voltaire Device Manager VFM Voltaire Fabric Manager VGA Video Graphic Adapter VLAN Virtual Local Area Network VNC Virtual Network Computing W WWPN World Wide Port Name X XHPC Xeon High Performance Computing XIB Xeon InfiniBand G 4 BASS for Xeon Installation and Configuration Guide Index A Adaptec RAID Configuration Utility G 1 adapters placement H 1 Apache server 4 6 B backbone network 1 10 BASA for Xeon v1 2 B 1 bind attribute F 1 Brocade switch configuration 9 15 enabling 4 18 C CISCO Switch configuration 9 8 CLOS 8 7 cluster definition 1 1 Cluster DB B 1 Migration B 1 ClusterDB Reinstalling B 3 Saving B 2 cluste
39. The Global Licence is included in the standard product e Storeway Optima 1250 Quick Start Guide specific to the storage system should be available e addresses predefined in the ClusterDB must be the same as those set in Storeway Master for the Optima 1250 These may be retrieved using the storstat di command 4 4 2 Initializing the Optima 1250 Storage System 1 The network settings of the Optima 1250 storage system will need to be configured for the first start up of the StoreWay Master module if this has not already been done by manufacturing Configure you LAPTOP with the local address 10 1 1 10 Connect it to the Ethernet Port of the Optima 1250 storage system using an Ethernet cross cable 4 12 5 for Xeon Installation and Configuration Guide Insert the Software and manual disk delivered with the Optima 1250 storage system into you CD drive The autorun program will automatically start the navigation menu Select Embedded Storeway Master set up Review the information on the screen and click the next button The program searches the embedded master module using the addresses 10 1 1 5 and 10 1 1 6 Use the embedded module MAC address for each controller whose network settings are being configured The IP addresses of the Ethernet management LAN ports must be set according to the values predefined in the ClusterDB Enter and confirm the new password and then click the confi
40. This document also contains additional information about High Availability for I O nodes and the Cluster DB 6 3 1 Enabling Lustre Management Services on the Management Node 1 Restore the Lustre system configuration information if performing a software migration etc lustre directory var lib Idap lustre directory if Lustre High Availability is included Verify that the O and metadata nodes information is correctly initialized in the ClusterDB by running the command below lustre_io_node_dba list Configuring File Systems 6 5 This will give output similar to that below displaying the information specific to the O and metadata nodes There must be one line per I O or metadata node connected to the cluster IO nodes caracteristics id name type netid clus_id HA_node net_stat stor_stat lustre_stat 4 ns6 I 6 1 ns7 100 0 100 OK 5 ns7 IM 7 1 ns6 100 0 100 OK The most important things to check are that the I O nodes are listed with the right type for OSS and or M for MDS High Availability node is the right one It is not a problem if net stat stor stat lustre stat are not set However these should be set when the file systems are started for the first time In there are errors the ClusterDB information can be updated using the command lustre io node dba set Note Enter lustre io node dba help for more information about the different parameters available for lus
41. nline Location 0 5 Yendor FUJITSU Size 140272 LogicalDisk RDSSAS Role Device 6 SN DQ00PE5004L9 WWN 500000E01203CE80 State Online Location 0 6 Vendor FUJITSU Size 140272 LogicalDisk RDSSAS Role Device 7 SN WD WCANYS792380 WhIN Unknown State Readu Location 0 7 Vendor WDC Size 239372 LogicalDisk Role BASS for Xeon Installation and Configuration Guide 3 6 STEP 6 Creating and Deploying an Image Using Ksis This step describes how to perform the following tasks 1 Installation and configuration of the image server 2 Creation of an image of the COMPUTE X Node and Login or I O or Login IO Reference Node installed previously 3 Deployment of these images on cluster nodes These operations have to be performed from the Management Node Please refer to BAS5 for Xeon High Availability Guide if High Availability is to be included for any part of your cluster to check that all the High Availability configurations necessary are in place on the Reference Node image Note create and deploy a node image using Ksis all system files must be on local disks and not on the disk subsystem To create an I O node image for example all disk subsystems must be unmounted and disconnected b a It is only possible to deploy an image to nodes that are equivalent and have the same hardware architecture Platform Disks Network interface See The BASS for Xeon Administrator s G
42. problems Checking and Starting the Daemons on COMPUTE X and Login IO Nodes Check to see if the Slurmctld daemon has started on the Management Node and the Slurmd daemon has started on the combined LOGIN IO or dedicated LOGIN and on a COMPUTE X Node by using the command scontrol show node all If NOT then start the daemons using the commands below e the Management Node service slurm start e the Compute Nodes service slurm start Verify that the daemons have started by running the scontrol show node command again Testing kdump NJ rporoni It is essential to use non stripped binary code within the kernel Non stripped binary code is included in the debuginfo RPM kernel debuginfo lt kernel_release gt rpm available from http people redhat com duffy debuginfo index js html i This package will install the kernel binary in the folder usr lib debug lib modules lt kernel_version gt In order to test that kdump is working correctly a dump can be forced using the commands below echo 1 gt proc sys kernel sysrq echo c gt proc sysrq trigger The end result can then be analysed using the crash utility An example command is shown below The vmcore dump file may also be found in the var crash folder crash usr lib debug lib modules kernel version vmlinux vmcore BASS for Xeon Installation and Configuration Guide Chapter 4 Configuring
43. scp rootG Management Node IP address etc hosts etc hosts 3 5 2 Configuring Ganglia 1 Copy the file usr share doc ganglia gmond 3 0 5 templates gmond conf into etc 2 Edit the etc gmond conf file 3 50 55 for Xeon Installation and Configuration Guide In line 18 replace xxxxx with the basename of the cluster name xxxxx replace with your cluster name In line 24 replace x x x x with the alias IP address of the Management Node host x x x x replace with your administration node ip address 3 Start the gmond service service gmond start chkconfig level 235 gmond on 3 5 3 Configuring the kdump kernel dump tool 1 Reserve memory in the kernel that is running for the second kernel that will make the dump by adding crashkernel 128M 16M to the grub kernel line so that 128MBs of memory at 1 6MBs is reserved in the boot grub grub conf file as shown in the example below kernel vmlinuz 2 6 18 53 e15 ro root LABEL nodmraid console ttyS1 115200 rhgb quiet crashkernel 128M 16M It will be necessary to reboot after this modification 2 The following options must be set in the etc kdump conf configuration file a The path and the device partition where the dump will be copied to should be identified by its LABEL dev sdx or UUID label either in the home or directories Examples path var crash ext3 dev sdb1 ext3 LABEL boot ext3 UUID 03138356 5e61
44. switches the topology is most likely CLOS 3 e If the network uses both kinds of switches then the topology is certainly CLOS 5 E The System Administrator should know which topology applies to his cluster If not contact Bull for more information Pre requisite All the following switch configuration commands take place inside the config sm menu To enter this menu proceed as follows ssh enable switchname enable switchname s password voltaire Welcome to Voltaire Switch switchname connecting switchname config switchname config sm switchname config sm 8 3 1 Setting the Topology CLOS stage 1 Use the sm info show command and look at the topology and active topology fields to check which topology setting is in place for the cluster This should match the setting required for the cluster lt switchname gt config sm sm info show subnet manager info is smName port guid 0008f1040041254a topology 5 stage CLOS active topology 5 stage CLOS lt algorithm up down active algorithm up down Installing and Configuring InfiniBand Interconnects 8 7 sm KEY 0000000000000000 sm priority 3 sm sweep interval seconds 15 sm verbosity mode error sm topology verbosity none sm mads pipeline 16 sm polling retries 12 sm activity 98663 sm state master sm mode enable sm LMC 0 sm hoq 16 sm slv 16 sm mopvl v10 14 subnet pre
45. 0 1 2 Example An example of a description file is shown below for a node with an InfiniBand interface cat etc sysconfig network scripts ifcfg ibO DEVICE ib0 ONBOOT yes BOOTPROTO static NETWORK 172 18 0 0 IPADDR 172 18 0 4 Note The value of last byte octet of the IPADDR address is always 1 more than the value for the machine number For example in the interface above the machine number is 3 ns3 and so the last byte in the IPADDR setting is 4 Checking the interfaces It is recommended that the configuration of the Ethernet and InfiniBand interfaces are verified to ensure that all the settings are OK This is done by running the command below for InfiniBand interfaces pdsh w node n m cat etc sysconfig network scripts ifcfg ib 0 1 2 or the command below for Ethernet interfaces pdsh w node n m cat etc sysconfig network scripts ifcfg eth 1 2 3 Alternatively to see the interface settings separately in groups for a set of nodes use the commands below BASS for Xeon Installation and Configuration Guide Note examples below show the commands to be used for InfiniBand interfaces For Ethernet interfaces replace the adapter interface identifier accordingly for example replace ifcfg ibO with ifcfg eth 1 pdsh w node n m cat etc sysconfig network scripts ifcfg ib0 grep IPADDR pdsh w node n m cat etc sysconfig network scripts ifcfg ib0 gre
46. 1 3 2 Program Execution eee eren 1 15 1 4 Bull 5 for Xeon software distribution vues iren noeh petites Pacem Co ia 1 16 1 4 1 Installing Software and Configuring Nodes 1 16 Chapter 2 Updating BAS5 for Xeon v1 1 clusters to BAS5 for Xeon v1 2 2 2 1 BASS for Seca TE Files x eooiteare muse htun deren qu ise im e ee OM VU 2 1 2 2 High Om E 2 1 2 2 1 Optional for SLURM clusters only uidet ice irt e prota Gea adus eq ts at 2 2 22 2 BASS for Xeon v1 1 Configuration fies teer ret tn ROSE et pee uem Dra 2 3 2 3 Pre installation Operations for 5 for Xeon v1 2 XHPC 2 3 2 4 Pre installation Operations for BASS for Xeon v1 2 Optional Software 24 2 5 Install BASS for Xeon v1 2 on the Management 2 5 2 5 1 Configure BASS for Xeon v1 2 Management 2 5 2 6 Install BASS for Xeon v1 2 on the Reference 2 6 2 7 Deploy the 5 for Xeon v1 2 Reference Node 2 7 2 7 1 Deployment Pre Requisites EE HIS REN Ur ea
47. 2008 NO OK D A ES8GB IN ul 30 09 09 08 2008 Shutting down dhcpd Starting dhcpd Wed Wed INSERT group ALL INSERT group IO INSERT group COME INSERT group ME INSERT group NOD INSERT group ADM Wed J Wed Jul 30 09 09 Wed Jul 30 09 09 Wed Jul 30 09 09 Stopping ConMan Starting ConMan Wed Jul 30 09 09 Wed Jul 30 09 09 Wed Jul 30 09 09 Wed Jul 30 09 09 08 2008 08 2008 08 2008 conmand conmand 08 2008 08 2008 08 2008 08 2008 OK Jul 30 09 09 07 2008 NO Jul 30 09 09 07 2008 NO O O O O Begin synchro for syshosts synchro for syshosts Begin synchro for sysdhcpd synchro for sysdhcpd Begin synchro for group xena 1 18 30 33 140 141 xena 1 2 11 17 18 140 141 xena 3 8 14 xena 10 12 xena 0 18 30 33 140 141 0 synchro for group Begin synchro for pdsh synchro for pdsh Begin synchro for conman INITIALIZATION of the services Running configuration check done Resetting host status in DB Stopping NovaScale Master nagios Starting NovaScale Master nagios OK syslog ng Reloading syslog ng syslog ng pid 2998 Reloading syslog ng Wed Jul 30 09 09 10 2008 NO Wed Jul 30 09 09 10 2008 NO OK OK pid 2998 Wed Jul 30 09 09 10 2008 NO 3 Switch to po
48. 4ab3 b58e 27507ac41937 The tool to be used to capture the dump must be configured Uncomment the core collector line and add a 1 as shown below core collector makedumpfile c d 1 c indicates the use of compression and 1 indicates the dump level It is essential to use non stripped binary code within the kernel Non stripped binary code is included in the debuginfo RPM kernel debuginfo lt kernel_release gt rpm available from http people redhat com duffy debuginfo index js html i This package will install the kernel binary in the folder usr lib debug lib modules lt kernel_version gt Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 5 1 3 5 4 3 5 4 1 3 5 4 2 3 52 Note The size for the dump device must be larger than the memory size if no compression is used Use the command below to launch kdump automatically when the system restarts chkconfig kdump on Installing and Configuring SLURM optional NJ rporoni SLURM does not work with the PBS Professional Batch manager and must only be installed on clusters which do not use PBS Professional The SLURM files are installed under the usr and etc directories Note These steps must be carried out for each COMPUTE X and LOGIN Reference Node Installing SLURM on the Reference Nodes 1 Mount NFS from the release directory on the Management Node to the release director
49. Check if service is stopped before This returns O if successful reinstall Reinstall the MGS service using the erase and install target Free the loopback reservation using the losetup d command This returns if successful BASS for Xeon Installation and Configuration Guide clear Clean the loopback map using losetup a and losetup d commands Returns if successful 4 Installation of the MGS service on the Management station You must apply this section before running lustre and before running the lustre util install command Ensure that the lustre cfg file is completed correctly and dispatched Use the lustre_util set_cfg tool Run the command below to install the MGS service service mgs install mgs installed OK 5 Start the MGS service on the Management Node service mgs start Starting mgs on xenaO0 mgs 0 is not running mgs started OK When there is no High Availability on the Management Node the service must be started at boot time Run the command below in order to ensure that the MGS service restarts chkconfig add mgs 6 3 7 Lustre Pre Configuration Checks Save the lustre cfg file and quit the editor 1 Once the lustre cfg file has been edited copy it to the Secondary Management node for clusters which feature High Availability for the Management Node 2 Use the service mgs status command to check that mgs service is running on the Management Node service mgs stat
50. Guide Chapter 1 Cluster Configuration This chapter explains the basics of High Performance Computing in a LINUX environment It also provides general information about the hardware and software configuration of a Bull BASS for Xeon HPC system The following topics are described e 1 1 Introduction e 1 2 Hardware Configuration e 1 3 Software Environment e 1 4 Bull BASS for Xeon software distribution 1 1 Introduction A cluster is an aggregation of identical or very similar individual computer systems Each system in the cluster is a node Cluster systems are tightly coupled using dedicated network connections such as high performance low latency interconnects and sharing common resources such as storage via dedicated file systems Cluster systems generally constitute a private network this means that each node is linked to the other nodes in the cluster This structure allows nodes to be managed collectively and jobs to be launched on several nodes of the cluster at the same time 1 2 Hardware Configuration Note Bull BAS5 for Xeon High Performance Computing cluster nodes use different NovaScale Xeon servers Cluster architecture and node distribution differ from one configuration to another Each customer must define the node distribution that best fits his needs in terms of computing power application development and I O activity The System Administrators must have fully investigated and confirmed the planned n
51. Nodes e NFS Network File System can be used to share file systems in the home directory across all the nodes of the cluster e lustre Parallel File System This chapter describes how to configure these three file structures 6 1 Setting NIS to share user accounts For those clusters which include dedicated O LOGIN nodes there is no need to use NIS on the Management Node 6 1 1 Configure NIS on the Login Node NIS server 1 Edit the etc sysconfig network file and add a line for NISDOMAIN definition NISDOMAIN lt DOMAIN gt Any domain name be used for DOMAIN however this name should be the same on the Login node which is acting as the NIS server and on all the Compute Nodes NIS clients 2 Start the ypserv service service ypserv start 3 Configure ypserv so that it starts automatically whenever the server is started chkconfig ypserv on 4 Initialize the NIS database usr lib64 yp ypinit m Note When a new user account is created the YP database should be updated by using the command cd var yp make Configuring File Systems 6 1 6 1 2 6 2 Configure NIS on the Compute or and the I O Nodes NIS client 1 Edit the etc sysconfig network file and add a line for the NISDOMAIN definition NISDOMAIN lt DOMAIN gt Any domain name may be used for DOMAIN however this name should be the same on the Login node whi
52. ULIS Ht 2 7 2 72 2 8 2 7 3 Deploy the Image on the Cluster 2 8 2 7 4 PostDeploymentConfiguraliah pon dpa br AE RPM ES 2 8 2 7 5 Configuring Interconnect Interfaces cicer repre darse tierra dr desto 2 9 2 7 6 Post Deployment Operations ade acne uctor pases Pope Ne bog DIES 2 9 2 7 7 Restoring I O Node aliases v apvd C d pat 2 9 2 7 8 Reconfiguring Cluster Suite on High Availability I O 2 11 2 8 Post Deployment Checks a ER Iuuen R 2 11 2 8 1 Optional for SLURM ru D 2 11 Chopter 3 Installing BAS5 for Xeon v1 2 Software on the HPC Nodes 3 1 Installation Process IU Sers o 3 2 3 0 Pre installation Backup Operations when Re installing BASS for Xeon 1 2 3 3 Table of Contents v vi 3 1 3 2 3 3 3 4 3 5 eC c uie ot ots cy odo staat 3 3 3 0 2 Saving SSH Keys of the Nodes and of root 3 4 3 0 3 Saving the Storage Configuration 3 4 3 0 4 Saving t
53. admin and node host that the syslog ng service has started on both hosts service syslog ng status The output should be syslog ng pid 3451 is running 2 On the node host run the command below to test the configuration logger Test syslog ng 3 On the node host check in the var log messages file if the message is present 4 On admin host check in the var log HOSTS node hostname messages file if the message is present 3 7 4 Checking Nagios Both nagios and httpd services have to be running on the Management Node 3 66 BASS for Xeon Installation and Configuration Guide service nagios status nsm_nagios pid 31356 31183 19413 is running service httpd status gt httpd pid 18258 18257 18256 18255 18254 18253 18252 18251 5785 is running 1 Start a web browser Firefox Mozilla etc and enter the following URL http lt admin_node_name gt NSMaster Fichier Edition Affichage Historique Marque pages Outils Aide i http xena0 NSMaster U Untitled mozilla org mozillaZine j mozdev org Figure 3 17 Launching NovaScale Master 2 Then left click on the Start Console button Bull NovaScale Master 5 3 0 Mozilla Firefox x Fichier Edition Affichage Historique Marque pages Outils Aide e CEN E Eh G 0 nttpuixenao NSMaster z gt 64 amp Lj Untitled jmozilla org j mozillaZine j mozdev org
54. applications Modules Preface Chapter 8 Chapter 9 Appendix A Appendix B Appendix C Appendix D Appendix E Appendix F Appendix G Appendix H Appendix I Bibliography Installing and Configuring InfiniBand Interconnects Describes the tasks for the installation and configuration of different Voltaire Devices Configuring Switches and Card Describes how to configure CISCO and Foundry Ethernet switches Voltaire InfiniBand and Brocade switches Default Logins for different cluster elements Details the default logins for different cluster elements Migrating and Reinstalling the Cluster Database Describes how to migrate the Cluster Database Migrating Lustre Describes how to migrate to Lustre v1 6 x Manually Installing BAS5 for Xeon Additional Software Configuring Interconnect Interfaces Describes the config ipoib command for Ethernet interconnects Interface description file Binding Services to a Single Network Describes the use of the bind attribute in the etc xinetd conf file to restrict Configuring AOC USASLP S8iR RAID Adapters for NovaScale R423 and R425 machines PCI Slot Selection and Server Connectors Activating your Red Hat Account Glossary and Acronyms Lists the Acronyms used in the manual Refer to the manuals included on the documentation CD delivered with you system OR download the latest manuals for your Bull Advanced Server BAS release and for your cluster hardware
55. being misconfigured The Ethernet interfaces are named ethO eth eth2 etc according to the PCI bus order So when a new Ethernet board is added the Ethernet interface names may be changed if the PCI bus detects the new board before the existing on board Ethernet interfaces PCI bus detection is related to the position of the PCI slots To avoid misconfiguration problems of this type before installing a new Ethernet board you should 1 Obtain the MAC addresses of the on board Ethernet interfaces by using the ifconfig ethO and ifconfig eth commands 2 After the new Ethernet board has been installed obtain the MAC addresses of the new Ethernet interfaces obtain all the MAC addresses using the ifconfig command 3 Edit each etc sysconfig network scripts ifcfg ethX file ethX ethO eth etc and add an HWADDR MAC ADDRESS attribute for each interface in each file according to the Ethernet interface name and the MAC address obtained in Step 2 above Configuring Switches and Cards 9 17 9 18 55 for Xeon Installation and Configuration Guide Appendix A Default Logins for different cluster elements Element Login Password Comments Baseboard administrator administrator Management Controller enable voltaire Equivalent to root gt used for InfiniBand switches configuration switch admin 123456 Read only admin admin Ethernet switches admin admin Same login and password f
56. clusters which do not include Lustre High Availability e If there is a cluster database but no management tools are provided for the storage devices being used This file allows you to populate the lustre ost and lustre mdt tables using the usr lib lustre load_storage sh script 9 Skip this phase for a migration to BASS for Xeon v1 2 or if BASS for Xeon v1 2 is being reinstalled as the etc lustre directory will have been saved 6 3 4 Configuring the High Availability services Lustre High Availability clusters only Lustre HA Carry out the actions indicated in the Checking the Cluster Environment and the Using Cluster Suite sections in the Configuring High Availability for Lustre chapter in the BASS for Xeon High Availability Guide 63 5 Lustre Pre Configuration Operations 1 Change the Lustre user password The lustre_mgmt rpm creates the lustre user on the Management node with lustre as the password It is strongly advised to change this password by running the following from the root command line on both Primary and Secondary Management nodes for High Availability systems passwd lustre The lustre user is allowed to carry out most common operations on Lustre filesystems by using sudo In the next part of this document the commands can also be run as lustre user using the sudo lt command gt For example sudo lustre_util status 2 Set Lustre Network layers Lustre runs on all
57. copper Ethernet or as a fiber 100 1000 Mbps port when using an SFP transceiver in the corresponding SFP port The Fastlron LS 648 includes two 10 Gigabit Ethernet slots that are configurable with single port 10 Gigabit Ethernet pluggable modules e Fastlron LS switches include an integral non removable AC power supply An optional one rack unit high AC power supply unit can be used to provide back up power for up to four Fastlron LS switches e The BIGIRON RX 4 RX 8 and RX 16 racks include 4 8 or 16 I O modules that in turn can accommodate either 1 Gigabit Ethernet or 10 Gigabit Ethernet ports See The www cisco com and www foundry com for more details regarding these switches Chapter 8 in the BAS5 for Xeon Installation and Configuration Guide for more information on configuring Ethernet switches Storage The storage systems supported by BASS for Xeon include the following Storeway 1500 and 2500 FDA Storage systems Based on the 4Gb s FDA Fibre Disk Array technology the Storeway 1500 and 2500 networked FDA Storage systems support transactional data access associated with fibre and SATA disk media hierarchies RAID6 double parity technology enables continued operation even in the case of double disk drive failures thus providing 100 times better data protection than for RAIDS Brocade Fibre Channel switches are used to connect FDA storage units and help to ensure storage monitoring within NovaScale Master HPC Ed
58. for Xeon release for details of any restrictions which may apply Installing BAS5 for Xeon v1 2 Software on the HPC Nodes 3 1 Installation Process Overview The process to install Bull BASS for Xeon v1 2 on the HPC cluster s nodes is divided into different steps to be carried out in the order shown below Backups Operations when Re installing 55 for Xeon v1 2 Skip this step if you are installing for the first time This step only applies in the case of a re installation when the cluster has already been configured or partially configured and there is the desire to save and reuse the existing configuration files for the re installation of BAS5 for Xeon v1 2 Installing the RHEL5 1 software on the Management node 1 RAID configuration optional 2 Installation of the Red Hat Enterprise Linux 5 Server software Page 3 5 3 First boot settings ae A Configuring the Network 5 Installing an external Storage System Installing Bull BASS for Xeon software on the Management Node 1 Installing Bull XHPC XIB and XLustre software Page 3 20 2 Database Configuration Configuring equipment and installing utilities on the Management Node 1 Configuring Equipment Manually small clusters only 2 Configuring Ethernet switches 3 Installing and configuring Ganglia Syslog ng NTP Postfix Kdump P 326 SLURM and PBS Pro on the Management Node mes 4 Installing compilers only on Management Nodes which include Login functi
59. from http support bull com The Bull BASS for Xeon Documentation CD ROM 86 A2 91EW includes the following manuals e HPC BASS for Xeon Installation and Configuration Guide 86 A2 87EW e Bull HPC BASS for Xeon Administrator s Guide 86 A2 88EW e Bull HPC BASS for Xeon User s Guide 86 A2 89EW e Bull HPC BASS for Xeon Maintenance Guide 86 A2 90EW e Bull HPC BASS for Xeon Application Tuning Guide 86 A2 16FA BASS for Xeon Installation and Configuration Guide e HPC BASS for Xeon High Availability Guide 86 A2 21FA The following document is delivered separately e Software Release Bulletin SRB 86 A2 71 EJ b The Software Release Bulletin contains latest information for your BAS delivery This should be read first Contact your support representative for more information In addition refer to the following e Bull Voltaire Switches Documentation CD 86 A2 79ET e NovaScale Master documentation For clusters which use the PBS Professional Batch Manager e PBS Professional 9 2 Administrator s Guide on the PBS Professional CD ROM e PBS Professional 9 2 User s Guide on the PBS Professional CD ROM Highlighting Commands entered by the user are in a frame in Courier font as shown below mkdir var lib newdir e System messages displayed on the screen are in Courier New font between 2 dotted lines as shown below e Values to be
60. independent service which has to be run separately You can only have one MGS running per node 1 When you are configuring your etc lustre lustre cfg file there are some fields that have to be filled in to link the MGS with the Lustre core b Before the LUSTRE MGS HOST LUSTRE_MGS_NET fields are filled check that the host node is valid by running the command gethostip dn lt host_name gt This will list the host name and its IP address This is particularly recommended when there are multiple interfaces for a node LUSTRE MGS HOST name of the Management Node where the MGS service is installed This value is used by the lustre_util tool to link the MGS with other Lustre entities for example MDS OSS LUSTRE MGS NET the name of the network used to read the MGS for example TCP or o2ib When the o2ib net type is used the LUSTRE MGS HOST name value has to be suffixed with icO which is hostname suffix for IB networks Configuring File Systems 6 9 6 10 For example if you need to use an InfiniBand network to reach the MGS entity that runs on the node zeus6 you have to set LUSTRE MGS NET to o2ib set LUSTRE MGS HOST to zeus6 icO USTRE MGS ABSOLUTE LOOPBACK FILENAME file for mgs loop device The default is home lustre run mgs loop When High Availability exists for the Management Node select a directory which is shared for the Management Node pairs This value is used by the MGS serv
61. is 123456 To change to Privileged mode 1 Once in admin mode enter enable 2 Enter the following password at the prompt voltaire 8 2 3 Starting a CLI Management Session via Telnet 1 Establish a Telnet session with the Voltaire device 2 Atthe Login prompt type the user name admin 3 Atthe Password prompt type the default password 123456 To change to Privileged mode 8 2 BASS for Xeon Installation and Configuration Guide 8 2 4 8 2 5 8 2 5 8 2 5 2 8 2 6 4 Once in admin mode enter enable 5 Enter the following password at the prompt voltaire 6 Enter the appropriate CLI commands to complete the required actions Configuring the Time and Date Use the command sequence below to configure the time and date parameters for the switch The time and date will appear on event reports that are time stamped 1 Enter Privileged mode from Exec mode 2 Setthe time and date For example time 8 22 AM date June 21 2008 enable password Hostname setup Names configuration menu Enter the switch name configuration menu as follows clock set 062108222008 ssh enable switchname enable switchname s password voltaire Welcome to Voltaire Switch switchname Connecting switchname switchname config names switchname config names t Setting up the system name The switch name can be set as follows switchname config names system name set switch h
62. lustre util install f etc lustre models fsl lmf V This operation is quite long as it formats the underlying file system about 15 minutes for a 1 TB file system Do not use the V option if a less verbose output is required At the top of the checking terminal the following should appear Filesystem fsl Cfg status formating Status offline Mounted 0 times Filesystem fsl Cfg status installed Status offline Mounted 0 times The last line printed at the execution terminal must be Filesystem fsl SUCCESSFULLY installed 6 Enable the file system by running the following command lustre_util start f fsl V Configuring File Systems 6 13 This operation is quite long about 10 minutes for a 1TB file system Do not use the V option if a less verbose output is required At the top of the checking terminal the following should appear Filesystem 51 Cfg status installed Status starting Mounted 0 times Filesystem 51 Cfg status installed Status online Mounted 0 times The running status of the OSTs MDT must also be online The last lines printed at the execution terminal must be FILESYSTEMS STATUS filesystem config running number migration status status of clts fsl installed online 0 0 OSTs migrated 7 Mountthe file system on clients Run the following command lustre util mount f fsl n list of client nodes using
63. manager and must only be installed on clusters which do not use PBS Professional The SLURM files are installed under the usr and etc directories Note This step applies to the Management Node only The same configuration file will be copied later to the other nodes in the cluster see STEP 5 3 3 11 1 Install the SLURM RPMS Run the command below to install the SLURM RPMs yum install slurm pam slurm slurm munge slurm auth none slurm devel Note Munge and munge libs are included within the slurm munge RPM and will not need to be installed separately See The Software Release Bulletin for BAS5 for Xeon v1 2 for details on how to install SLURM version 1 0 15 This version is required to ensure compatibility with the LSF Batch Manager 3 3 11 2 Create and Modify the SLURM configuration file A SLURM configuration file must be created using the parameters that describe the cluster The etc slurm slurm conf example file can be used as a template to create the etc slurm slurm conf file for the cluster The slurm conf file can be created manually from the template described above OR the tool found at usr share doc slurm 1 3 2 html configurator html can be used to help define the necessary parameters This tool is an HTML file that when loaded into a browser e g Firefox will generate a slurm conf file in text format using the parameter supplied by the user The generated file can be saved o
64. network layers that can be activated in the kernel for example InfiniBand or Ethernet 2 NJ rportont By default the Lustre model file delivered is set to the elan nettype The nettype parameter in the etc lustre models fs1 Imf file must be changed to o2ib for InfiniBand networks and tcp for Ethernet networks 6 8 BASS for Xeon Installation and Configuration Guide If Ethernet is used as the Lustre network layer and there are several physical links you must select the links to be used by Lustre This is done by editing the etc modprobe d lustre file See The Lustre Operations Manual from CFS Section Multihomed Servers sub section modprobe conf available from http manual lustre org for more details 3 Set the etc lustre lustre cfg file a Editthe etc lustre lustre cfg file of the Management Node b Set LUSTRE MODE to XML This should already have been done c Set CLUSTERDB to yes if not already done Lustre HA Carry out the actions indicated in the Installing the Lustre LDAP Directory and the Cluster DB Synchronisation using lustredbd sections in the Configuring High Availability for Lustre chapter in the BASS for Xeon High Availability Guide 6 3 6 Configuring the Lustre MGS service The Lustre MGS service must be installed and configured the Management Node before Lustre is installed The Lustre MGS service is not managed by the lustre_util tool It is an
65. on the Cluster Start the deployment by running the command ksis deploy image name node If for example 3 Compute Nodes are listed as ns 2 4 then enter the following command for the deployment ksis deploy imagel ns 2 4 Note reference nodes may be kept as reference nodes and not included in the deployment Alternatively the image may be deployed on to them so that they are included in the cluster It is recommended that this second option is chosen 24 Post Deployment Configuration 2 7 4 1 Edit the postconfig script Before running the postconfig command in the section below the postconfig script will need editing as follows 2 8 BASS for Xeon Installation and Configuration Guide 2 7 4 2 27 0 1 Run the command below to disable the configuration of the interconnect interface ksis postconfig disable CONF_60_IPOIB 2 Recompile the postconfig sript by running the command below ksis postconfig buildconf postconfig command Once the image deployment has finished the cluster nodes will need to be configured according to their type Compute I O etc Post deployment configuration is mandatory as it configures Ganglia Syslog ng NTP SNMP and Pdsh on the nodes The Ksis postconfig command configures each node that the image has been deployed to in the same way ensuring that all the cluster nodes of a particular type are homogenous Ksis post configuration is
66. pdsh syntax For example if the client nodes are 50 and ns2 then run lustre util mount f 51 n ns 0 2 At the top of the checking terminal the following should appear Filesystem 51 Cfg status installed Status online Mounted 2 times The last line printed at the execution terminal must be Mounting filesystem fsl succeeds on ns 0 2 The file system is now available As administrator it will be possible to create user directories and to set access rights It is possible to check the health of the file system at any time by running lustre_util status 6 14 55 for Xeon Installation and Configuration Guide This will display a status as below FILESYSTEMS STATUS filesystem config running number migration status status of clts fsl installed online 2 0 OSTs migrated CLIENTS STATUS filesystem correctly mounted If more details are required then run lustre util all info f all The file system health can also be checked in the Nagios view of the Management Node Configuring File Systems 6 15 6 16 5 for Xeon Installation and Configuration Guide Chapter 7 Installing Intel Tools and Applications 7 1 7 2 7 2 1 7 2 2 This chapter describes how to install tools or commercial software from CDs or supplier sites Intel Libraries Delivered Some applications delivered with the Bull XHPC CD ROM have b
67. the Upgrade an existing installation option will not be in place 3 14 Disk partitioning There are different disk partitioning options available according to whether you are installing for the first time and using the default partitioning provided by LVM OR are carrying out a reinstallation and wish to use the partitioning that already exists Installing BAS5 for Xeon v1 2 Software on the HPC Nodes 3 9 3 1 4 1 3 10 Note Default partitioning RED HAT ENTERPRISE LINUX 5 Installation requires partitioning of your hard drive By default a partitioning layout is chosen which is reasonable for most users You can either choose to use this or create your own a Remove linux partitions on selected drives and create default layout gt Select the drive s to use for this installation Advanced storage configuration Review and modify partitioning layout jBetease Notes Figure 3 6 Partitioning screen 5 The default disk partitioning screen will appear as shown above Usually all the default options can be left as shown above as the partitioning will be handled automatically by Logical Volume Manager LVM Click on Next If there is more than one disk for the Management Node they will all appear checked in the drive list in Figure 3 6 and will be reformatted and the Red Hat software installed on them Deselect those disks where you
68. the basic functions of the slurm setup sh script Notes e slurm conf file must have been created on the Management Node and all the necessary parameters defined BEFORE the script is used to propagate the information to the Reference Nodes e use of the script requires root access and depends on the use of the ssh pdcp and pdsh tools Running the slurm_setup sh script As the root user on the Management Node execute the script supplying the names of the LOGIN and COMPUTE X Reference Nodes to be configured for example etc slurm slurm setup sh N login0O compute0 computex0 The script will run and in the process read the slurm conf file copy it and other required files to the Reference Nodes create the SlurmUser create the job credential keys and create the log files as needed Additional slurm_setup sh script options The following additional options are available for greater control of the slurm_setup sh script or for debugging purposes etc slurm slurm_setup sh N lt reference node list gt lt slurm user password gt b lt slurm base pathname gt v LF uid gid Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 53 3 5 4 4 3 54 Parameters N Comma separated list of reference nodes not including the node on which the script was invoked After running the script on the local node the script and other files will be copied to the Refer
69. the corresponding fields in the etc storageadmin nec admin conf file login2 admin password L3 c Then restart the iSM manager service etc init d iSMsvr restart 4 4 BASS for Xeon Installation and Configuration Guide 3 FDA CLI Configuration a Copy the etc iSMSMC iSMSM sample file into the etc iSMSM iSMSM conf file b Restart the CLI manager service etc init d iSMSMC restart Enabling ssh access from the Management Node on a Linux System Note This part of the process is only required when the FDA software is installed on a system other than the Management Node There is no need to enable ssh access if the NEC software is located locally on the Management Node If this is the case skip this paragraph ssh is used by the management application to monitor the FDA storage systems ssh must be enabled so that FDA management tools operate correctly on the cluster Management Node Distribute RSA keys to enable password less connections from the cluster Management Node 1 Logon as root on the cluster Management Node and generate asymmetric RSA keys 2 Goto the directory where the RSA keys are stored Usually it is ssh You should find id_rsa and id_rsa pub files The pub file must be appended to the authorized_keys file on the Linux FDA manager system The authorized_keys file defined in the etc sshd_config file by default ssh authorized_keys must be used 3 If no k
70. the storage system Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 19 3 2 STEP 2 Installing BAS5 for Xeon software on the Management Node A BR CO This step describes how to install the Bull BASS for Xeon v1 2 software on the Management Node s It includes the following sub tasks 1 Preparation for the Installation of the BASS for Xeon v1 2 XHPC software Preparation for the Installation of the BASS for Xeon v1 2 optional software Installation of Bull BAS5 for Xeon v1 2 software Configuration of the Database Preparation for the Installation of the Red Hat software on other cluster nodes To identify the CD ROM mount points look at etc fstab file USB CD ROMs look like dev scd media IDE CD ROMs look like dev hd media Note The examples in this section assume that media cdrecorder is the mount point for the CD ROM During the installation procedure for Red Hat Enterprise Linux Server 5 some software packages are loaded that are specifically required for Bull BASS for Xeon clusters The following section describes the installation of these packages along with the Bull XHPC and optional InfiniBand XLustre and XToolkit software 3 2 1 Preparing the Installation of the Red Hat software 1 Create the directory for the software mkdir p release RHEL5 1 2 Create a mount point for the RHEL5 1 DVD by running the command below mkdir p m
71. the virtual interface ve 1 interface configuration mode myswitch config vlan 1 finterface ve 1 myswitch config vif 1 Assign an IP address to the virtual interface ve 1 ip address ip a b c d gt lt netmask a b c d gt myswitch config vif 1 ip address 10 0 0 254 255 0 0 0 Exitthe interface configuration myswitch config vif 1 exit myswitch config 5 The portfast mode for the spanning tree is the default mode myswitch config fast port span 6 Seta password for the enable mode For example myswitch config enable password myswitch 7 Enable the telnet connections and set a password myswitch config enable telnet password admin 8 Exit the configuration myswitch config exit 9 Save the configuration in RAM myswitch write memory Configuring Switches and Cards 9 13 9 14 10 Update the switch boot file on the Management Node 11 Run the following commands from the Management Node console touch tftpboot switch configure file chmod ugotw tftpboot switch configure file Note switch configure file name must include the switch name followed by confg for example myswitch confg 12 Save and exit the switch configuration from the switch prompt myswitch copy running tftp tftp server switch configure file myswitch exit Indicate the IP address of the Service Node for the tftp server this is generally t
72. to do this 3 14 5 for Xeon Installation and Configuration Guide RED HAT ENTERPRISE LINUX 5 The root account is used for administering the system Enter a password for the root user Root Password Release Notes Figure 3 14 Root Password Screen 8 Set the Root password as shown in Figure 3 13 This must use a minimum of 6 characters 2 LZ Red Hat Enterprise Linux 5 Package Installation RED HAT ENTERPRISE LINUX 5 The default installation of Red Hat Enterprise Linux Server includes a set of software applicable for general internet usage What additional tasks would you like your system to include support for Software Development Web server You can further customize the software selection now or after install via the software management application Customize later Customize now Release Notes Figure 3 15 Software selection screen Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 15 9 Leave the screen with the additional tasks deselected as shown in Figure 3 14 Click on Next RED HAT ENTERPRISE LINUX 5 Click next to begin installation of Red Hat Enterprise Linux Server A complete log of the installation can be found in the file root install log after rebooting your system A kickstart file containing the installation options selected can be found in the file root anaconda ks cfg afte
73. version 10 compilers Follow the instructions written in the Bull notice supplied with the compiler 3 5 8 Configuring the MPI User Environment See Section 3 3 15 in this chapter for details 3 58 55 for Xeon Installation and Configuration Guide 3 5 9 Bull Scientific Studio The Bull Scientific Studio RPMs are installed automatically on the COMPUTE X LOGIN reference nodes See The BASS for Xeon User s Guide and System Release Bulletin for more information on the libraries included in Scientific Studio 3 5 10 NVIDIA Tesla Graphic Card accelerators optional The drivers for both the NVIDIA Tesla C1060 card and for the NVIDIA Tesla 1070 accelerator are installed automatically on the COMPUTE X LOGIN reference nodes NJ rportoni The NVIDIA Tesla C1060 card is used on NovaScale R425 servers only and the NVIDIA Tesla 1070 accelerator can be used by both NovaScale R422 E1 and R425 servers 31 NVIDIA CUDA Toolkit optional The NVIDIA CUDA Toolkit and Software Development Kit are installed automatically on the LOGIN COMPUTE and COMPUTEX reference nodes for clusters which include Tesla graphic accelerators so that the NVIDIA compilers and the NVIDIA mathematical and scientific libraries are in place for the application Configuring NVIDIA CUDA Toolkit The PATH and LD LIBRARY PATH environmental variables should be modified to give access to the directories where the CUDA Toolki
74. 0 Different installation options are possible e Red Hat Enterprise Linux Server 5 distribution all clusters e Bull BASS for Xeon distribution all clusters e Bull HPC Toolkit monitoring tools all clusters e Bull XIB software for clusters which use InfiniBand interconnects e Bull XLustre software for clusters which use the Lustre Parallel file system In addition there are two installation possibilities for the Compute Nodes These are e Minimal Compute or COMPUTE Node which includes minimal functionality and is quicker and easier to deploy e Extended Compute or COMPUTEX Node which includes additional libraries and will take longer to deploy These nodes are used for most ISV applications and for applications that require a graphical environment X Windows They are also installed if there is a need for Intel Cluster Ready compliance o This chapter describes 5 for Xeon v1 2 installation process for clusters without any form of High Availability in place For clusters which include some form of High Availability this manual must be used in conjunction with the BAS5 for Xeon High Availability Guide For example if your cluster includes High Availability for the Lustre file system refer to the chapter in the High Availability Guide which refers to the configuration of High Availability for Lustre as well as this chapter See The Software Release Bulletin delivered with your BASS
75. 1 6 Operations Manual available from http www lustre org This chapter assumes an existing BAS4 for Xeon cluster has migrated to 5 for Xeon without the XLustre 1 6 x RPMS being installed when the system was migrated Lustre has to be migrated from version 1 4 x to version 1 6 x on all clusters which install BASS for Xeon All data stored in the Lustre file systems should be backed up before Lustre is migrated WARNING The Lustre 1 6 Operations Manual states that a rolling upgrade is possible meaning that the file system is not taken out of commission for the migration However Bull only supports a Lustre migration which has been carried out on a system which has been completely stopped This ensures that the migration will be risk free and is simpler to carry out C 1 Migrating Lustre from version 1 4 to version 1 6 C 1 1 Pre Configuration for Migration 1 Disable High Availability for Lustre if it is in place For all Lustre file systems run the command lustre ldap unactive f fsname After running these commands it is strongly recommended to wait for 3 minutes This corresponds to the default duration for the Lustre HA timeout feature and will ensure that the commands are taken into account correctly 2 Stop all the file systems from the Management Node lustre util umount f all n all lustre util stop f all 3 Make a backup copy of the etc lustre direct
76. 10c model has 2 GB cache memory 4 x 4 Gb s FC front end ports and 2 x 4 Gb s FC back end disk ports and supports full SATA high capacity disk drive configuration 60 drives The AX4 5 model is a costeffective solution delivering performance scalability and advanced data management features It comes with Navisphere Express which simplifies installation configuration and operation It offers RAID protection levels 1 0 3 and 5 It has 2 GB cache memory 4 x 4 Gb s FC front end ports and 2 x 3 Gb s SAS back end expansion ports It supports up to 60 SATA or SAS mixed drives Note The EMC CLARiiON CX300 model is supported on older systems DDN S2A 9550 Storage systems The S2A9550 Storage Appliance is specifically designed for high performance high capacity network storage applications Delivering up to 3 GB s large file performance from a single appliance and scaling to 960 TBs in a single storage system Cluster Configuration 1 13 1 3 LA L3 lal 1 3 1 2 1 14 Software Environment Main Console and Hardware Management System Console The Management Node uses management software tools to control and run the cluster These tools are used for e Power ON Power OFF Force Power Off e Checking and monitoring the hardware configuration e Serial over LAN The IPMI protocol is used to access the Baseboard Management Controllers BMC which monitor the hardware sensors for temperature cooling fan speeds power m
77. 3 8 2 7 Setting up the switch IP dddressa co eno coge ts Aetna Ree eames 8 4 SEA MEE eU HU EIE 8 4 8 290 Routing Algoritms cssc aseo dee OC Nita eaaet On a uua ERAS au ef a AS 8 5 8 2 10 Subnet manager SM 56 iae E teo ot b EN NR 8 5 8 2 11 Configuring Passwords Pe 8 6 8 3 Configuring Voltaire switches according to the 87 8 3 1 Setting the Topology CLOS stage aise ostio iss ae ae AS RR IDA eae 8 7 8 3 2 Determining the node GUIDS sc BU S MEE 8 8 8 3 3 Adding NEWS DINGS ent sedo deum beige us 8 9 8 4 Performance manager ie veter t tentes eats 8 11 8 4 1 Performance manager ments eec eR pet be etie ttai e t Net ein eina 8 12 8 4 2 Activating the performance manager ssssssssssssseeeeeeee eene 8 12 8 5 FIP i E eee e ra e ER EO WE EHE ER vs 8 13 8 5 1 FTP configuration bed shed edi facite tias ibn adu uda 8 13 8 5 2 Setting Up E I eter t tine auia 8 13 ats OUE eer eee ET a ec UE D A 8 13 8 6 1 Giu Configuration Menu ecc teet to o OS wees enone 8 14 8 6 2 Generating group csy Te cedet mE SAN 8 14 8 6 3 Importing a new group csv file on a switch running Voltaire firmware 8 14 8 6 4 Importing a new group csv file on a switch running Voltaire 4 X
78. 3 Run the following command on the Management Node to ensure the Compute Node is visible for the PBS server qmgr c create node compute node name 4 Modify the initial script by removing the s P options from the options to pbs attach line line 177 This should appear as below Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 57 vi usr pbs lib MPI pbsrun mpich2 init in options to pbs attach j PBS JOBID 5 Add the MPIBull2 wrapper by using the command below usr pbs bin pbsrun wrap opt mpi mpibull2 xxx bin mpirun pbsrun mpich2 3 5 5 4 Initial configuration on the LOGIN Reference Node Modify the etc pbs conf file for the node as follows PBS_EXEC usr pbs PBS_HOME var spool PBS PBS_START_SERVER 0 PBS_START_MOM 0 PBS START SCHED 0 PBS_SERVER lt server_name gt 0 PBS_SCP usr bin scp a portant The Login Nodes have to be defined on the Management Node by creating a etc hosts equiv file containing the Login Node names one per line 3 5 6 Installing Compilers Install the Intel Compilers on the LOGIN Reference Nodes if required Follow the instructions written in the Bull notice supplied with the compiler du Intel Math Kernel Library MKL Install the Intel MKL libraries on the Compute Extended Compute and Login Reference Nodes if required Intel MKL is included with the Professional Editions of Intel
79. 3 6 Keyboard installation SOfeeli iussus cic dette rmt a be Prato basa tassi e 37 RHELS installation number dialog 37 Skip screen tor the installation BOTE or i Hep eon EP beta amitti ts 3 8 First 15 installation SCRESTI cea tronc AN 3 9 3 10 Confirmation of removal of any existing partitions 3 11 Modifying the partitioning layout 1st screen 3 11 Confirmation to remove existing partitions ccccccceeeeeeceececeeeeeeeenceeeeseeeesseeeeeteeenaes 3 12 15 Partitioning options screen me M 3 12 Confirmation of previous partitioning settings csse 3 13 Network Configuration Screen scutis odit eto ar retiro as aput ep xdi C crx RS nt qiie 3 13 I E er MM 3 14 Root Password OT OO DO 3 15 Software selection Sereen 3 15 screen 3 16 Launching NovaScale Waster csi tiers Rp bna b ep 3 67 NovaScale Master Welcome screen 0 ccecccceeeceeeeseeceseeeeceeeeenseeeseeeeenceeenteteesteeeees 3 67 NovaScale Master Authentication Window sssssssseeneeenee eee 3 68 The NovaScale Master console tors etr Rite m mp Ete iti vitis das mee a b Enti UU 3 68 NovaScale Master Monitoring VWiridaw setis
80. 5 95 1 2 5 2 storage vendor name list of storage system names to which the model is applicable Vendorspecific information cache configuration watermarks etc Declaration of RAID groups grouping disks in pools Declaration of spare disks Declaration of LUNs Declaration of LUN access control groups and mappings of internal external LUN numbers LUSTRE specific declarations for storage systems which use the LUSTRE global file system deployment Note With some versions of Fibre Channel adapter node drivers the correct detection of the LUNs for a storage device port is dependent on the accessibility of a LUN numbered It is recommended the Access Control groups for a storage device are configured so that the list of LUNs declared for each group always includes an external LUN that is numbered O A model file is created by manual by editing the file and its syntax is checked when the model is deployed to the storage systems Although there is no constraint about the location of storage model files a good practice is to store them in etc storageadmin directory of the Management Node 9 The Administrator should backup storage model files as model files may be reused later to reinstall a particular configuration Automatic Configuration of a Storage System The automatic configuration of storage system using a model file requires that the storage d
81. 6 HII25 0015 015 0 06706 35 0 ADAPTEC Virtual SGPIO 0001 1 laa GeV 1CEe La Ji ti fbb IL lice H L eVICe device evice Figure G 16 An example of a drive list for an Adaptec controller G 8 BASS for Xeon Installation and Configuration Guide 16 If everything is OK press Escape several times to go back until the Exit Utility menu appears as shown below lt lt lt Adaptec RAID Configuration Utility gt gt gt SPSS SS SESS SESS SESS SESE SESE SES ESE SESE SESE SESE SESE SESE SESE SES ES SESESSESSESSESESESESESS SPSS EEE EEE EEE EEE EEE EEE EEE EEE ESESES ESSE ESSE SESSSSSSSSE SESS SSSSSSSSSSSSSSSSSSSSESES SESS SE SESE SESE SESE SESE SESE SESE SESE SESE SESE SES ESE SESE SES ESE SES ES ES ESESESESESESESE SESE EEE EEE EE EEE EEE EEE EEE EEE EEE EEE ESE SEES ESSE SESE SESS SESS ESSE SESSESESSESSESESSESEDE SPSS S SESS ESSE ESSE ESSE ESSE ESSE ESSE ESSE SSS S ESSE ESSE SESE SESS SSS ESSE SSESSESSESSESSESSESSSSESE SPSS SSS SEES ESSE ESSE ESSE SEES ES ESE SSE ESE SESE SESE SESE SESE SES EEESESESESESESESESESESE SPSS SSSS SESS SESE SESE SESE SESS SESE SESE SESE SESE SESE SESE SESESEEEEESESESESESESESEEESE SPSS SESE SESE SESE SESE SESE SESE SESE SESE E SESE SEE ES ESE SESE SESE SESE ESE
82. 999999999999999999999999999999999999999999999 firray Configuration Utility SerialSelect Utility Disk Utilities sett 99999999999999929999999999999999999999999999999999999999999929299929299929299999999 9999999999999999999929999999999999999999999999999999999999999999999992999299992299 e999999999999999999999999999999999999992999999999999999999999992999999292999999992 e 999999999999999999999999999999999992999999999999999999999999299299999299299929224 99999999999999999999999999999999999999999999999999999999999999999999999999999294 Arrow keys to move cursor lt Enter gt to select option lt Esc gt to exit default Figure G 2 RAID Configuration Utility Options menu gt Array Configuration Utility Configuring AOC USASLP S8iR RAID Adapters for NovaScale R423 and R425 machines G 1 3 Select Manage Arrays from the Array Configuration Utility Main Menu as shown below Figure G 3 Array Configuration Utility Main Menu R The List of Arrays already installed will be displayed Select an array from the list to see its properties as shown in the examples below Mtis LM amp EE 11 00 RM T omoa Array 0 R RAID 5 Ar SIZE lt Stripe 512 1 Array Status 1 z I x00251t00 FUJITSU HAX3147R0 136 86 1 TO HAYZ AFRI 126 OF I 1 Figure G 4 Example of Array Properties for a RAID 5 Array G2 BASS for Xeon Installation and Configuration Gui
83. Bull BASS for Xeon systems Prerequisites Refer to the BASS for Xeon v1 2 Software Release Bulletin SRB for details of any restrictions which apply to your release Use this manual in conjunction with the BASS for Xeon High Availability Guide if your cluster includes any form of High Availability Structure This manual is organised as follows Chapter 1 Cluster Configuration Explains the basics of High Performance Computing in a LINUX environment It also provides general information about the hardware and software configuration of a Bull BASS for Xeon HPC system Chapter 2 Updating BASS for Xeon v1 1 clusters to BASS for Xeon v1 2 Describes how to update existing BASS for Xeon v1 1 clusters to BAS5 for Xeon v1 2 Chapter 3 Installing BAS5 for XEON Software on HPC Nodes Details the software installation processes possible for the different types of cluster nodes Chapter 4 Configuring Storage Management Services Describes how to configure the storage management software to manage the storage systems of the cluster Chapter 5 Configuring I O Resources for the Cluster Describes the use of storage model configuration files Chapter 6 Configuring File Systems Describes how to configure NIS on the Login and Compute Nodes setting NFSv3 file systems and configuring the Lustre Parallel File System Chapter 7 Installing Tools and Applications Describes how to install commercial tools Intel Compilers and MKL and other
84. Ethernet0 24 interface Vlanl ip address 10 0 0 254 255 0 0 0 no ip route cache ip http server logging history warnings logging trap warnings logging facility 1 10 snmp server community public RO control plane line con 0 password admin login line vty 0 4 password admin login line vty 5 15 password admin login 132 Configure a Foundry Networks Switch The following procedure works for the Fastlron and Biglron models 1 Set the enable mode FLS648 Switch gt enable No password has been assigned yet FLS648 Switch 2 Enter the configuration mode LS648 Switch configure terminal LS648 Switch config 3 Set the name of the switch in the form hostname switch name For example FLS648 Switch config hostname myswitch myswitch config 9 12 55 for Xeon Installation and Configuration Guide 4 Assign a management IP address in the form a on Fastlron FLS624 or FLS648 models Assign IP address to the switch ip address ip a b c d netmask a b c d myswitch config ip address 10 0 0 254 255 0 0 0 myswitch config b on Biglron RX4 RX8 and RX16 models Enter the Vlan 1 interface configuration mode myswitch config vlan 1 myswitch config vlan 1 Set the corresponding virtual interface this allows the management IP address to be configured myswitch config vlan 1 router interface ve 1 Enter
85. GB COMPUTE node with 48GB nova4 Deploy Deploy group nova3 HwRepair HwRepair group nova8 IO odes by type IO nova 6 10 ETA odes by type META nova 5 9 YFAME nsemble des fame du cluster 0 4 6 8 10 NODES128GB odes by memory size 128GB nova8 NODES16GB odes by memory size 16GB nova 1 3 7 NODES48GB odes by memory size 48GB nova 4 6 10 ODES64GB odes by memory size 64GB nova 0 5 9 12 OxTest OxTest group nova 0 6 TEST TEST group nova 5 9 UnitTest UnitTest group nova 1 9 3 Runa test command for a group of nodes as shown below pdsh g IO date dshbak c 4 If pdsh is functioning correctly this will give output similar to that below nova 6 10 Installing BAS5 for Xeon v1 2 Software on the HPC Nodes 3 65 Thu Aug 7 15 35 27 CEST 2008 37 2 Checking NTP 1 Run the following command on a COMUTE X node and on a combined LOGIN IO Login or dedicated LOGIN nodes ntpq p Check that the output returns the name of the NTP server and that values are set for the delay and offset parameters 2 Onthe Management Node start ntptrace and check if the Management Node responds ntptrace 172 17 0 99 ns0 stratum 11 offset 0 000000 synch distance 0 012695 3 From the Management Node check that the node clocks are identical pdsh w ns 0 1 dat ns0 Tue Aug 30 16 03 12 CEST 2005 nsl Tue Aug 30 16 03 12 CEST 2005 2 7 3 Checking Syslog ng 1 Check on the
86. GBs or unlimited Note Itis mandatory to restart the sshd daemons after changing these limits 3 44 55 for Xeon Installation and Configuration Guide 3 4 STEP 4 Installing RHEL5 1 BAS5v1 2 for Xeon Software and optional HPC software products on other nodes The Management Node has to be configured to be the NFS server that will install the Red Hat Linux distribution and the Bull BASS for Xeon HPC software on all the other nodes of the cluster Once the NFS environment has been correctly set all that is required is that the individual nodes are booted for the Linux distribution to be installed on them Only one node of each type has to installed as KSIS will be used for deployment for example install a single COMPUTE or COMPUTEX Node and then deploy it and or install a single IOZLOGIN Node and then deploy it See STEP 6 Before running the preparenfs script the prerequisites below must be satisfied Note Ifthe steps in the previous section have been followed correctly these prerequisites will already be in place 3 4 1 Preparenfs script prerequisites e node s that are to be installed must have been configured in the dhcpd conf file in order that an IP address is obtained on DHCP request e option nextserver and the option filename for each host has to be set correctly e DHCPD service must be running if not the script will try to start it e The XINE
87. HPC BAS for Xeon Installation and Configuration Guide REFERENCE 86 A2 87EW 01 HPC BASS for Xeon Installation and Configuration Guide Hardware and Software October 2008 BULL CEDOC 357 AVENUE PATTON B P 20845 49008 ANGERS CEDEX 01 FRANCE REFERENCE 86 A2 87EW 01 The following copyright notice protects this book under Copyright laws which prohibit such actions as but not limited to copying distributing modifying and making derivative works Copyright Bull SAS 2008 Printed in France Trademarks and Acknowledgements We acknowledge the rights of the proprietors of the trademarks mentioned in this manual All brand names and software and hardware product names are subject to trademark and or patent protection Quoting of brand and product names is for information purposes only and does not represent trademark misuse The information in this document is subject to change without notice Bull will not be liable for errors contained herein or for incidental or consequential damages in connection with the use of this material Preface Scope and Objectives This guide describes how to install or re install the Bull HPC 5 for Xeon v1 2 Bull Advanced Server software distribution and all other associated software on Bull High Performance Computing clusters It also describes the configuration tasks necessary to make the cluster operational Intended Readers This guide is for Administrators of
88. JobCompType jobcomp filetxt default is jobcomp none JobCompLoc var log slurm slurm job log SwitchType switch none ProctrackType proctrack pgid valid below for SLURM v1 0 15 JobAcctType jobacct linux default is jobacct none JobAcctLogFile var log slurm slurm_acct log Valid below for SLURM v1 3 2 JobAcctGatherType jobacct linux default is jobacct none AccountingStorageLoc var log slurm slurm_acct log FastSchedule 1 default is 1 FirstJobid 1000 default is 1 ReturnToService 1 default is 0 piDefault none default is none SlurmEventHandler usr lib clustmngt slurm slurmevent JobCredentialPrivateKey etc slurm private key JobCredentialPublicCertificate etc slurm public key NODE CONFIGURATION NodeName bali 10 37 Procs 8 State UNKNOWN PARTITION CONFIGURATION PartitionName global Nodes bali 10 37 State UP Default YES PartitionName test Nodes bali 10 20 State UP MaxTime UNLIMITED PartitionName debug Nodes bali 21 30 State UP Final Configuration Steps After the SLURM RPMs have been installed and all the necessary parameters for the cluster have been defined in the slurm conf file a few steps still remain before the configuration of SLURM is complete on the Management Node These steps can either be done using the slurm setup sh script see section 3 3 1 1 4 OR manually see section 3 3 1 1 5 Using the slurm setup sh Script The SLURM setup script is found in etc slurm slurm
89. O services Management Node Services The Management Node is dedicated to providing services and to running the cluster management software All management and monitoring functions are concentrated on this one node For example the following services may be included NTP Cluster DataBase Kerberos snmptrapd ganglia dhcpd httpd conman etc The Management Node can also be configured as a gateway for the cluster You will need to connect it to the external LAN and also to the management LAN using two different Ethernet cards A monitor keyboard and mouse will also need to be connected to the Management Node The Management Node houses a lot of reference and operational data which can then be used by the Resource Manager and other administration tools It is recommended to store data on an external RAID storage system The storage system should be configured BEFORE the creation of the file system for the management data stored on the Management node Login Node Services Login Node s are used by cluster users to access the software development and run time environment Specifically they are used to Login Develop edit and compile programs Debug parallel code programs 1 0 Node Services O Nodes provide a shared storage area to be used by the Compute Node when carrying out computations Either the NFS or the Lustre parallel file systems may be used to carry out the Input Output operations for BAS5 for Xeon clusters
90. Professional Editions of Intel version 10 compilers Follow the instructions written in the Bull notice supplied with the computer Configuring the MPI User environment MPIBull2 comes with different communication drivers and with different process manager communication protocols When using the InfiniBand OFED SLURM pairing the System Administrator has to verify that Users are able to find the OFED libraries required 3 42 5 for Xeon Installation and Configuration Guide User jobs can be linked with the SLURM PMI library and then launched using the SLURM process manager The MPIBull2 RPMs include 2 automatic setup files opt mpi modulefiles mpiBull2 mpiBull2 sh which are used to define default settings for the cluster User access to MPIBull2 The administrator has a choice of 3 different way of making MPIBull2 available to all users 1 Copying the mpibull2 environment initialization shell scripts from opt mpi mpiBull2 lt version gt share to the etc profile d directory according to the environment required For example For MPI cp opt mpi mpibull2 1 2 1 4 t share mpibull2 sh etc profile d For Intel C cp opt intel fce lt compiler_version gt bin ifortvars sh etc profile d For Intel Fortran cp opt intel cce lt compiler_version gt bin iccvars sh etc profile d 2 Using the module command with the profile files to load the MPIBull2 module for the end users module load your mpi version
91. Storage Management Services This chapter describes how to e Configure the storage management software installed on the Management Node e Initialize the management path to manage the storage systems of the cluster e Register detailed information about each storage system in the ClusterDB The following topics are described 4 1 Enabling Storage Management Services 4 2 Enabling FDA Storage System Management 4 3 Enabling DataDirect Networks DDN S2A Storage Systems Management 4 4 Enabling the Administration of an Optima 1250 Storage System 4 5 Enabling the Administration of EMC Clariion DGC storage system 4 6 Updating the ClusterDB with Storage Systems Information 4 7 Storage Management Services 4 8 Enabling Brocade Fibre Channel Switches Note When installing the storageadmin xxx rpms in update mode rpm U all the configuration files described in this section and located in etc storageadmin are not replaced by the new files Instead the new files are installed and suffixed by rpmnew Thus the administrators can manually check the differences and update the files if necessary See For more information about setting up the storage management services refer to the Storage Devices Management chapter in the Bull BAS5 for Xeon Administrator s Guide Unless specified all the operations described in this section must be performed on the cluster management station using the root account Configuration of Storage Ma
92. TD service must be running and configured to run tftp if not the preparenfs script will try to configure tftp and start the service e BMCs of the nodes must be already configured 3 4 2 Preparing the NFS node software installation Run the preparenfs command preparenfs Use the verbose option for a more detailed trace of the execution of the preparenfs script to be stored in the preparenfs log file preparenfs verbos Use th interactive option to force the installation to run in interactive mode All the Linux installation steps will be pre filled and will have to be confirmed or changed preparenfs interactiv The script will ask for the following information Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 45 1 The path containing the operating system you want to use to prepare PXE boot for example release RHEL5 1 In the example below number 2 would be entered from the options displayed The following Operating System s have been found in the release directory 0 Choose Custom PATH Red Hat Enterprise Linux Server 5 release TEST2 Red Hat Enterprise Linux Server 5 release RHEL5 1 Red Hat Enterprise Linux Server 5 release TEST1 No LE Select the line for the Operating System you want to use for the installation Select the partitioning method you want to use for the installation manual user defined partitioning you will be ask
93. YYY O 1 SUBNETMASK 255 255 0 0 DEFAULT GATEWAY none 2 The address settings used for IP addresses must match the addresses declared in the Management Database ClusterDB If these are not known please contact Bull technical support The IP addresses given in this section are examples and are for information only Note Bull 55 for Xeon clusters do not support VLAN Alias Creation on ethO Management Node Aliases provide hardware independent IP addresses for cluster management purposes The alias created below is used by the administration software see section 3 5 1 Go to the etc sysconfig network scripts directory 2 Copy the ifcfg ethO file to the ifcfg ethO O file 3 Edit the ifcfg ethO O file and modify the DEVICE setting so that it reads 0 0 as shown DEVICE eth0 0 4 Modify IPADDR with the alias IP address Restarting the network service Run the command service network restart BASS for Xeon Installation and Configuration Guide 3 1 10 External Storage System Installation The Management Node may be connected to an external storage system when the I O and Login functions are included in the same Service Node as the Management functions See Chapter 4 Configuring Storage Management Services in this manual for more information regarding the installation and also refer to the documentation provided with the storage system for details on how to install
94. a file system which uses all the available OSTs and the first available MDT with no failover If you want to create more than one file system and or with failover capability refer to Bull BASS for Xeon Administrator Guide or to the lustre_util man page for more details about the Lustre model files Run the following command lustre util info f etc lustre models fsl lmf This command prints information about the fs1 file system It allows you to check that the MDT and OSTs are actually those you want to use Ensure that no warning occurs 6 12 5 for Xeon Installation and Configuration Guide Lustre HA Carry out the actions indicated in the Configuring File Systems for Failover section in the Configuring High Availability for Lustre chapter in the BASS for Xeon High Availability Guide 3 Check what happened At this point it is possible to run the following command on a second terminal checking terminal to see what happened during the installation process watch lustre_util info f all The following message should be displayed No filesystem installed It is also possible to look at http lt mngt_node gt lustre from a Web browser See The lustre_util man page for more information 4 Install the file system Do not perform this step when performing a software migration as the Lustre configuration details and data will have been preserved 5 the following command
95. ace ve 1 ip address 172 17 18 210 16 end telnetG8myswitchi 9 2 Configuring a Brocade Switch 1 Set the Ethernet IP address for the brocade switch Use a portable PC to connect the serial port of the switch Notes TheReal Value IP address name of the switch to be used may be found in the cluster database FC SWITCH table e lt is mandatory to use the serial cable provided by Brocade for this step The initial configuration of the Brocade Fibre Channel Switch is made using a serial line see Silkworm 200E Hardware Reference Manual 2 Opena serial session cu s 9600 1 dev ttySO login admin Password password switch admin gt Configuring Switches and Cards 9 15 9 3 9 16 3 Initialize the IP configuration parameters according to the addressing plan Check the current IP configuration switch admin ipAddrShow Ethernet IP Address aaa bbb ccc ddd Ethernet Subnetmask xxx yyy zzz ttt Fibre Channel IP Address none Fibre Channel Subnetmask none Gateway Address xxx 0 1 1 Setthe new IP configuration s3800 admin ipAddrSet Ethernet IP Address aaa bbb ccc ddd lt new ip address gt Ethernet Subnetmask xxx yyy zzz ttt new subnet mask Fibre Channel IP Address none Fibre Channel Subnetmask none Gateway Address none new gateway address 4 lnitialize the switch name using the name defined in the ClusterDB switch adm
96. aire Device Manager VDM 9 16 Voltaire Fabric Manager VFM 9 16 Voltaire GridVision Fabric Manager 8 16 Voltaire Performance Manager 8 11 Voltaire switch topology 8 7 Voltaire Switching Devices 1 10 W wwn file 4 17 WWAPN description 4 17 X xinetd conf file F 1 Index I3 4 BASS for Xeon Installation and Configuration Guide BULL CEDOC 357 AVENUE PATTON B P 20845 49008 ANGERS CEDEX 01 FRANCE REFERENCE 86 A2 87EW 01
97. and the MAC address files have not been found you can collect the MAC addresses of the admin Ethernet cards for each node as follows Start the DHCPD service by running the command dbmConfig configure service sysdhcpd Configure the nodes so that they boot on the network Reboot the equipment individually and collect their MAC addresses in the var log messages file Create the file which contains the MAC addresses IP addresses and cluster elements Its format is as follows type name mac address An example similar to that below is available from usr lib clustmngt clusterdb install mac_file exp node validO 00 04 23 B1 DF AA node valid 00 04 23 B1 DE 1C node valid2 00 04 23 1 4 54 node valid3 00 04 23 B1 DF EC 1 Run the command Su postgres 2 Run command cd usr lib clustmngt clusterdb install 3 Run the following command to collect the domain name of each node of the cluster and load the MAC addresses for the network cards for the administration network updateMacAdmin lt file gt lt file gt is the name of a file that must have been created previously see point 2 The full path must be included so that it can be easily retrieved for example updateMacAdmin root cluster mac address 4 Go back to root by running the exit command 3 28 BASS for Xeon Installation and Configuration Guide 3 3 4 Configuring Ethernet Switches Nad portan
98. ation and data files will have been saved The automated configuration of Lustre O resources uses the storage model file described in Automated Deployment of the I O Configuration section in chapter 5 This model file details how Lustre uses the configured LUNs description of OST and MDT data and journal LUNs The Lustre tables in the Cluster database should be populated with the information found in the model file as described in this section 1 Declare the Lustre OST configuration stormodelctl m model name c generateost 2 Declare the Lustre MDT configuration stormodelctl m model name c generatemdt 3 Make the OSTs and MDTs available for the Lustre filesystem lustre investigate check Configuring I O Resources for Lustre after Manual I O Configurations This phase must take place after executing the procedures described in the Manual Configuration of Storage Systems and Manual Configuration of I O resources for Nodes in Chapter 5 The Lustre tables in the Cluster database must be populated using the etc lustre storage conf file Adding Information to the etc lustre storage conf file See The BASS for Xeon Administrator s Guide for more details about the storage conf file Configuring File Systems 6 7 This phase should be done in the following situations e If there is a need to use the Lustre file system and no cluster database is available as may be the case for
99. b x Bull rpm available either from the BAS5 for Xeon v1 2 XHPC DVD ROM or from Bull Technical Support by running the command below rpm ivh clusterdb data ANY412 20 4 1 b x Bull nodeps 2 Change to postgres su postgres 3 Go to the install directory on the Management Node cd usr lib clustmngt clusterdb install 4 Runthe command preUpgradeClusterdb This command modifies and creates dumps of the Cluster DB data and schema preUpgradeClusterdb 5 If the preUpgradeClusterdb command completes without any errors copy the preclusterdball2041 dmp and preclusterdbdata204 1 dmp files onto an external storage device Contact Bull Technical Support if there are any errors 6 Stop the postgresql service to prevent the cluster database from being modified Cluster Database Operations B 1 service postgresql stop 7 Install BAS5 for Xeon v1 2 on the cluster by following the installation procedure described in Chapter 3 in the BASS for Xeon Installation and Configuration Guide WARNING Read Chapter 3 carefully before installing BAS5 for Xeon Be sure that all data has been backed up onto non formattable media outside of the cluster Check that all the necessary configuration files necessary have been saved and backed up as described in the Pre installation Operations when Re installing BAS 5 v1 2 section 8 Copy across the dump files saved in point 5 from the external storage device to
100. block device gt home_nfs mount lt release dedicated block device gt releas or if labels have been applied to the file systems mount LABEL lt label for home_nfs dedicated block device gt home_nfs mount LABEL lt label for release dedicated block device gt releas Configuring File Systems 6 3 3 Edit the etc fstab file and add the following lines for the settings which are permanent these are physical devices disks dedicated to NFS usage LABEL release release auto defaults 0 0 LABEL home_nfs home_nfs auto defaults 0 0 4 Use the adduser command with the d flag to set the home_nfs directory as the home directory for new user accounts adduser d home nfs NFS user login NFS user login 6 2 2 Setup for NFS v3 file systems Configuring the NFSv3 Server 1 Edit the etc exports file and add the directories that are to be exported release ro sync home_nfs rw sync 2 Restart the NFS service service nfs restart 3 Configure the NFS service so that it is automatically started whenever the server is restarted chkconfig nfs on Note Whenever the NFS file systems configuration is changed etc exports modified then the exportfs command is used to configure the NFS services with the new configuration exportfs r exportfs f Configuring the NFSv3 Client 1 Create the directories that will be used to mount the NFS file s
101. carried out by running the command ksis postconfig run PostConfig cluster name nodelist For example ksis postconfig run PostConfig xena 1 100 Configuring Interconnect Interfaces Use the config ipoib command to configure the interconnect interfaces for both InfiniBand and Ethernet networks See Zu rA dd Appendix E Configuring Interconnect Interfaces for details on using the config ipoib command Post Deployment Operations Restoring 1 Node aliases Once the BASS for Xeon v1 2 I O Reference Nodes have been deployed the aliases have to be restored on each I O Node According to whether or not a storage model exists for the cluster either a or b below is used to restore the aliases a Where a storage model exists then use the deployment command from the Management Node as shown below stordepmap m model name i lt nodelist gt b If no storage model exists use the stordiskname command to create a new disknaming conf file as shown below Updating BASS for Xeon v1 1 clusters to BASS for Xeon v1 2 2 9 The existing disknaming conf file will be erased when the new I O nodes are deployed The stordiskname command should be used with the r option remote from the Management Node enabling backups and restorations of the etc storageadmin disknaming conf file to be managed automatically If the r option is not used the Administrator will have to manage th
102. ccess control rules for O nodes e Configuration of specific parameters cache size cache policy watermarks etc Phase 2 The configuration of coherent naming for I O node resources e Definition of logical names aliases for LUNs that maintain device names following reboots e Configuration of Quorum disks optional for High Availability The I O configuration can either be automatically deployed with some exceptions configured manually 5 1 Automatic Deployment of the 1 Configuration Nel orant Automatic deployment of the O configuration is not possible for Optima 1250 and EMC CLARiiON AX4 5 storage systems These systems must be configured manually The automatic deployment of the storage configuration uses a model file that describes the data volumes that have to be created and how the nodes can access them See The BASS for Xeon Administrator s Guide for more detailed information about configuration models and the deployment process 2 Storage Model Files A template for the creation of a storage configuration model can be obtained with the following command stormodelctl c showtemplate This template contains declaration examples for storage systems supported from the different storage vendors A model file is specific to storage systems of the same type from a specific vendor The model file contains the following information Configuring I O Resources for the Cluster
103. ch is acting as the NIS server and on all the Compute or Nodes NIS clients 2 Edit etc yp conf and add a line to set the Login Node as the NIS domain server domain DOMAIN server login node 3 Modify the etc nsswitch conf file so that passwd shadow and group settings are used by NIS passwd files nisplus nis shadow files nisplus nis group files nisplus nis 4 Connect to the NIS YP server service ypbind start 5 Configure the ypbind service so that it starts automatically whenever the server is restarted chkconfig ypbind on Note The NIS status for the Compute or I O Node can be verified by using the ypcat hosts command This will return the list of hosts from the etc hosts file on the NIS server Nodes which use an image deployed by Ksis The etc sysconfig network file is not included in an image that is deployed from the reference node to the other Compute or I O Nodes This means that the NISDOMAIN definition has to be added manually to the files that already exist on the Compute or I O Nodes by using the command below pdsh w cluster x y echo NISDOMAIN lt DOMAIN gt gt gt etc sysconfig network The restart ypbind service then has to be restarted so that the NIS domain is taken into account pdsh w cluster x y service ypbind restart BASS for Xeon Installation and Configuration Guide 6 2 Configuring NFS v3 to share the home_nfs and releas
104. ckage which is available on the CD ROM below delivered with the machines which use these adapters SUPERMICRO AOC USAS SRL 2 Then run the commands below service stor_agent stop chkconfig stor_agent off 3 Check that RAID has been configured correctly by running the command lsiocfg cv more 4 Look for the host which has aacraid displayed against it Verify that the detailed information for the Logical and Physical disks displays correctly as shown in the example below host aacraid 0 256 Optimal SMC ADC USAS S8iR LP DR 1 1 5 2437 Filz 5 2 0 15575 Interface SAS SATA Slot 7 SN 4FAFFO LogicalDisks 2 Number 1 Name RD1 Device sdd Status ptimal Raid 1 Size 239190 DiskLocations 0 3 0 1 Number 2 Name RDSSAS Device sde Status ptimal Raid 5 Size 280188 DiskLocations 0 4 0 5 0 6 PhysicalDisks 8 Device 0 SN WD WCAPW5S321110 WhIN Unknown State Ready Location 0 0 Yendor WDC Size 476940 LogicalDisk Role Device 1 SN WD WCANYS792200 WhIN Unknown State Online Location 0 1 Yendor WDC Size 239372 LogicalDisk RD1 Role Device 2 SN WD WCANYS792290 WhIN Unknown State Ready Location 0 2 Yendor WDC Size 239372 LogicalDisk Role Device 3 SN WD WCANYS600678 WhIN Unknown State Online Location 0 3 Vendor WDC Size 239372 LogicalDisk RD1 Role Device 4 SN DQO0P65004LP WWIN 500000E01203CF60 State Online Location 0 4 Vendor FUJITSU Size 140272 LogicalDisk RD5SSAS Role Device 5 SN D OOPB5004ND WWN 500000E01203D2A0 State
105. configuring Cluster Suite on High Availability O Nodes See The BASS for Xeon High Availability Guide for details of how to use the stordepha command Lustre nodes and the stordepha nfs command NFS nodes for clusters which have High Availability in place for the I O nodes 2 8 Post Deployment Checks Carry out the post deployment checks that are described in STEP 7 in Chapter 3 in this manual 2 8 1 Optional for SLURM clusters Once SLURM version 1 3 2 has been installed following the system update to BAS5 for Xeon v1 2 then all previously saved state information must be cleared using the c option for example slurmctld c lt job_name gt or use the command etc init d slurm startclean The node state information for SLURM version 1 3 2 will be taken from the new configuration file Updating BASS for Xeon v1 1 clusters to BASS for Xeon v1 2 2 11 2 12 5 for Xeon Installation and Configuration Guide Chapter 3 Installing BAS5 for Xeon v1 2 Software on the HPC Nodes E Read this chapter carefully and install the BASS for Xeon v1 2 software that applies to your cluster This chapter describes the complete installation process for the FIRST installation from scratch of the BAS5 for Xeon v1 2 software environment on all nodes of a Bull HPC cluster The same process can also be used for a reinstallation of BASS for Xeon v1 2 using the existing configuration files see section 3
106. covered Make the following settings Seta FDA name which is the same as the name already defined in the ClusterDB disk array table Enable the SNMP traps and send the traps to the cluster Management Node It is possible to connect to the server via the browser using one of the FDA Ethernet IP addresses if the iSM GUI is not available Use the password C to access the configuration menu See The User s Guide for the FDA storage system for more information 3 Check that end to end access is correctly setup for the cluster Management Node nec admin n fda name i ip address of the Windows FDA management station c getstatus all Configuration of Storage Management 4 7 43 Enabling DataDirect Networks DDN 52 Storage Systems Management 4 3 1 4 3 2 Enabling Access from Management Node Edit the etc storageadmin ddn admin conf file to configure the singlet connection parameters Port number used to connect to RCM API server of ddn port 8008 login used to connect to ddn login admin Password used to connect to ddn password password The configuration file uses the factory defaults connection parameters for the S2A singlets The login and password values may be changed Enabling Date and Time Control If the HPC cluster includes DDN storage systems check and if necessary update the etc cron d ddn set up date time cron file to modify regular time checks Ensure that
107. d can only be run at the time of the first installation or if there is a demand to change the IP address for some reason 4 9 Configuration of Storage Management 4 3 5 2 4 10 ddn init I ddn name This command performs the following operations Set the IP address on the management ports Enable telnet and API services Set prompt Enable syslog service messages directed to the Management Node using a specific UDP port 544 Enable SNMP service traps directed to the Management Node Set date and time Set common user and password on all singlets Activate SES on singlet 1 Restart singlet Set self heal Set network gateway ddn init command tips e init command should not be run on the DDN used by the cluster nodes as this command restarts the DDN e Both singlets must be powered on the serial access configured conman and portserver and the LAN must be connected and operational before using the ddn init command e Randomly the DDN may have an abnormally long response time leading to time outs for the ddn_init command Thus in case of error try to execute the command again e ddn init command is silent and takes time Be sure to wait until it has completed A WARNING The ddn_init command does not change the default tier mapping It does not execute the save command when the configuration is completed Initialization from a Laptop w
108. d for version 1 6 clean extents on dirs sh 4 Setup the new MGS entity on the Management Node and upgrade the Lustre layout by running the upgrade lustre layout sh script from the Management Node upgrade lustre layout sh C2 BASS for Xeon Installation and Configuration Guide 5 Update the Lustre file system descriptions For each Lustre file system run the command lustre util update f lt fsname gt 6 Restart the Lustre file systems lustre util start f all lustre util mount f all n all 7 Enable Lustre High Availability if it is in place For all Lustre file systems run the command lustre ldap active f fsname After running these commands it is strongly recommended to wait for 3 minutes This corresponds to the default duration for the Lustre HA timeout feature and will ensure that the commands are taken into account correctly Migrating Lustre C3 4 BASS for Xeon Installation and Configuration Guide Appendix D Manually Installing BAS5 for Xeon Additional Software If the preparenfs command was NOT used to install the additional software options XIB and or XLUSTRE and or XTOOLKIT the process to install them manually is described below 1 Mount NFS from the release directory on the Management Node to the release directory on the Service Node ssh Service Node mount t nfs Management Node IP release releas 2 Install the optional BASS for Xeo
109. d to import host details from a group csv file The group csv file is used to supply data to the switch subnet manager This data is used to create the mapping GUID Therefore recognisable hostnames should be used to make switch identification easier In addition it also contains geographical information that may be useful when using Voltaire Fabric Manager Sample from an existing group csv Type Id guid name Don t show in group Rack Id Location in rack U HCA 2c9020024b8f4 zeus14 0 2 0 U Installing and Configuring InfiniBand Interconnects 8 13 8 6 1 Group Configuration menu Enter the group configuration menu as follows ssh enable switchname enable switchname s password voltaire Welcome to Voltaire Switch switchname Connecting switchname config switchname config group switchname config group 8 6 2 Generating a group csv file The group csv file can be generated automatically by using the IBS command as follows user host tmp ibs a group s switchname NE Successfully generated configuration file group csv To update a managed switch with a firmware version 4 X proceed as follows Log onto the switch Enter the enable mod Enter the config menu Enter the group menu Type the following command group import tmp To update a managed switch with a firmware version 3 X proceed as follows Log onto the switch Enter the enable mod Enter the config menu
110. de Figure G 5 Example of Array Properties for a RAID 1 array 5 Press the Escape key to return to the previous screen and select Create Array from the Main Menu All the drives connected to the server will be displayed those that are shown with OKB in the final column see example below will not be accessible as they are already included in an array NP a ii Figure G 6 Example of drive list for a server Configuring AOC USASLP S8iR RAID Adapters for NovaScale R423 and R425 machines G3 6 Press F7 to select the drives to be included in the new array Only drives of the same size can be selected for the new array see figure below MYS 233 668 mem ee Figure G 7 Selection of drives of the same size for new RAID array 7 Press Enter when all the drives have been selected for the new array The Array Properties screen appears see Figures below Select the Array Type to be configured followed by the other properties size label etc for the array LLL E E E S M M NM EM M MI s o e o Figure G 8 Array Properties Array Type G 4 BASS for Xeon Installation and Configuration Guide etie L1 Sr i 1 A I IN T I 1 I IC I I I Figure G 9 Array Properties Writ
111. de is connected to the correct storage system Check the connection of each DDN storage system using the following command ddn conchk I ddn name f Note This command can only be used if ConMan is in place for the DDN storage systems Check that the LUNs are accessible for the storage systems connected to each node by using the command below lsiocfg dv 2 Deploy the aliases for the I O resources from the Management Node As a prerequisite ssh must have been configured password less to allow the Management Node to run remote operations on the nodes connected to storage systems Run the command below using the model file created previously when the storage system was automatically configured stordepmap m model name WARNING This command is silent and long Be sure to wait until the end This operation transmits configuration information to each node attached to the storage system defined in the specified model file A check is made to ascertain which storage resources are accessible from each node compared with the LUNs defined in the model file for it A symbolic link alias is then created for each disk resource that corresponds to a storage system LUN declared in the model file for the node 3 Check aliases created for I O resources Use the following command on each node to check that the aliases have been created correctly BASS for Xeon Installation and Configuration Guide
112. ductor D DDN Data Direct Networks DHCP Dynamic Host Configuration Protocol DIB Device Interface Board DDR Double Data Rate E EIP Encapsulated IP EPIC Explicitly Parallel Instruction set Computing EULA End User License Agreement Microsoft F FCR Fibre Channel Router FDA Fibre Disk Array FSS Fame Scalability Switch FTP File Transfer Protocol G GCC GNU C Compiler GNU GNU s Not Unix GPL General Public License Gratuitous ARP A gratuitous ARP request is an Address Resolution Protocol request packet where the source and destination IP are both set to the IP of the machine issuing the packet and the destination MAC is the broadcast address xx xx xx xx xx xx Glossary and Acronyms G 1 Ordinarily no reply packet will occur Gratuitous ARP reply is a reply to which no request has been made GUI Graphical User Interface GUID Globally Unique Identifier H HDD Hard Disk Drive HPC High Performance Computing HSC Hot Swap Controller IB Infiniband IDE Integrated Device Electronics Input Output Board with 11 PCI Slots IOC Input Output Board Compact with 6 PCI Slots IPD Internal Peripheral Drawer IPMI Intelligent Platform Management Interface IPR IP Router iSM Storage Manager FDA storage systems K KSIS Utility for Image Building and Deployment KVM Keyboard Video Mouse allows the keyboard video monitor and mous
113. e Thel O architecture of the server The following paragraphs cover these aspects and provide recommendations for the installation of adapters for different NovaScale servers The process to follow is quite easy 1 Create a list of the adapters to be installed sorted from the highest bandwidth requirement to the lowest 2 Place these adapters in each server using the priority list specific to the platform as defined in this Appendix PCI Slot Selection and Server Connectors H 1 H 2 Creating the list of Adapters The first step is to make a list of all the adapters that will be installed on the system Then if the I O flow for the server is known expected bandwidth from the Interconnect bandwidth to the disks etc it will be possible to estimate the bandwidth required from each adapter and then sort the adapters according to the requirements of the operational environment If there is no information about real expected O flows the adapters should be sorted according to their theoretical limits As both PCI Express adapters and PCI X adapters may be connected 2 tables are provided for the adapters supported by BASS for Xeon These are sorted by throughput giving the HBA slotting rank Adapter Bandwidth Fibre channel dual ports 800 MB s 1 2 Fibre channel single ports 400 MB s 2 Gigabit Ethernet dual port 250 MB s 1 2 Gigabit Ethernet single port 125 MB s 2 Ethernet 100 Mbps 12 5 MB s
114. e directories 6 2 1 Preparing the LOGIN node NFS server for the NFSv3 file system Firstly create a dedicated directory mount point for the NFS file system which is dedicated to home usage As the home directory is reserved for local accounts it is recommended that home_nfs is used as the dedicated home directory for the NFS file system Recommendations Use dedicated devices for NFS file systems one device for each file system that is exported The lsiocfg d command will provide information about the devices which are available Use the LABEL identifier for the devices Use disks that are partitioned E If a file system is created a disk which is not partitioned then mount cannot be used with the LABEL identifier The disk device name e g dev sdX will have to be specified in the etc fstab file Notes The following instructions only apply if dedicated disks or storage arrays are being used for the NFS file system The following examples refer to configurations that include both home_nfs and release directories If the release NFS file system has already been exported from the Management Node ignore the operations which relate to the release directory in the list of operations below 1 Create the directories that will be used to mount the physical devices mkdir home_nfs mkdir release 2 Mount the physical devices mount lt home_nfs dedicated
115. e software is going to be installed as shown below Before you click yes to confirm this check that the BMC for the node is reachable If this is not the case answer no and manually reboot your node later Do you want prepareNFS to perform a hard reboot via the usr sbin nsctrl command on the node s listed for the installation y n 3 4 3 Launching the NFS Installation of the BAS5v1 2 for Xeon software 1 The Bull BAS5v1 2 for Xeon software will be installed immediately after the reboot The progress of the install can be followed using conman via a serial line and or by using vncviewer if you have chosen to use VNC 2 Once the Linux distribution has been installed the kickstart will then manage the installation of the optional HPC product s selected for the installation and the node will then reboot The node can then be accessed to carry out any postinstallation actions that are required using the ssh command the root password is set to root by default 3 The preparenfs script will generate a log file root preparenfs log on the Management Node that can be checked in case of any problems See Appendix D Manually Installing Bull BAS5v1 2 for Xeon Additional Software in this manual if there is a need to install any of the additional software options XIB XLUSTRE and XTOOLKIT later after completing this step 3 48 55 for Xeon Installation and Configuration Guide 3 5 STEP 5 Configuring Administration So
116. e 3 20 Figure 3 21 Figure G 1 Figure G 2 Figure G 3 Figure G 4 Figure G 5 Figure G 6 Figure GZ Figure G 8 Figure G 9 Figure G 10 Figure G 11 Figure G 12 Small Cluster Yo MOC 1 3 Medium sized Cluster Architecture susce cn escrita roni va ce A UU Red Du oae v deg unis 1 3 large Cluster Architecture 122 2 papecscdacqenensmsaaadina sain ias specus s Durt 1 4 NovaScale R423 server eit iate i ARURARRIASIARENRURIDUGU Ren EE 1 5 NovaScale 440 server I 1 5 NovaScale R460 pu LU Gab 1 5 NovaScale R421 server T 1 7 NovaScale R421 ET server cic T 1 7 NovaScale R422 R422 EI 1 8 NVIDIA Tesla 1070 eosam 1 8 NovaScale MEE CL o LT S 1 9 NVIDIA Tesla C1060 internal graphic card ta pit ort agrees 1 9 NovaScale R480 E1 nichts sets rra aa EUREN Iren is ae a 1 9 The Welc me SCE Mao ccc hpaqcanegusinecanneviensocinedicaewauins dias va E tna PHA NU a E
117. e PCI Express slot and plug the Host Channel Adapter into the slot handling the HCA carefully by the bracket 3 Press the HCA firmly into the PCI Express slot by applying pressure on the top edge of the bracket 4 Reinstall any fasteners required to hold the HCA in place Connect the InfiniBand cable to either of the HCA ports and to the switch 6 Reconnect the host to its power source and power up the system Installing and Configuring InfiniBand Interconnects 8 1 8 2 Configuring the Voltaire ISR 9024 Grid Switch 8 2 1 Connecting to a Console Connect the Management Node with a terminal emulation program to the RS 232 console interface according to the instructions in the Hardware Installation Guide Make sure that the terminal emulation program is configured as follows Setting Value Terminal Mode VT 100 Baud 38400 Parity No Parity Stop Bits 1 Stop Bit Flow Control None Table 8 1 Voltaire ISR 9024 Switch Terminal Emulation Configuration 8 2 2 Starting a CLI Management Session using a serial line To start a Command Line Interface management session for the switch via a HyperTerminal connection do the following 1 Connect the switch via its serial port using the cable supplied by Voltaire 2 Start the HyperTerminal client 3 Configure the terminal emulation parameters as described in the section above 4 Type in the appropriate password at the logon prompt The Admin default password
118. e backup of the etc storageadmin disknaming conf file manually When used remotely r option immediately after I O node deployment the stordiskname command must be used in update mode u option This ensures that the LUNs are addressed by the same symbolic link names as used previously and avoids having to configure the file system again i The stordiskname command should be executed from the Management Node as shown below If the node is NOT in a High Availability pair stordiskname u r node name If the node is in a High Availability pair stordisknam u r nodel name node2 name Note For some storage systems not including FDA and DDN the stordiskname command may return an error similar to the one below Error This tool does not manage configuration where a given UID appears more than once on the node If this happens try running it with the m SCSI ID option ii The symbolic links aliases must be recreated on each node using the information contained within the disknaming conf file newly created by stordiskname To do this run the stormap command as below If the node is NOT in a High Availability pair ssh root lt node_name gt stormap c If the node is in a High Availability pair ssh root lt nodel_name gt stormap ssh root lt node2_name gt stormap 2 10 5 for Xeon Installation and Configuration Guide 2 7 8 Re
119. e caching Note Itis recommended that Write Caching is disabled however this is not obligatory 8 Confirm all the values for the new RAID array by selecting Done as shown in the Figure below The settings below are an example only Figure G 10 Array Properties Confirmation screen Configuring AOC USASLP S8iR RAID Adapters for NovaScale R423 and R425 machines G 5 9 Exit the Array Configuration Utility Press Escape several times until the Options screen appears and select SerialSelect Utility as shown below SerialSelect Utility IFES 4 i BB BET Figure G 11 RAID Configuration Utility Options Menu 10 Select Controller Configuration as shown below Figure G 12 RAID Configuration Utility Options Menu gt Controller Configuration G 6 BASS for Xeon Installation and Configuration Guide 11 Check all the settings for the Controller see Figure below Drives Write eee eee d Runtime BIQG 3 i isvwccaccescccteccttecuscveceste AutonaticsFallovert soc cacceceaccsceeccesceeeconc firray Background Consistency Check 22 Array based BES HEL Disabled SATA Native Command Queuing ssssssss Enabled Physical Drives Display during POST Disabled DVD CD ROH Boot Support cccccccscccccseeesess Disabled Removable Hedia Devices Boot Support Disabled Ala
120. e esee R Ter et Ue us e RR euo G7 RAID Configuration Utility Options Menu gt Disk Utilities eee G 8 An example of a drive list for an Adaptec controller sssssssssssses G 8 RAID Configuration Utility Exit Utility menu cssc oet eov vem ed ob ree eret G 9 NovaScale R421 rear view of Riser architecture H 3 NovaScale R421 rear view Connectors eer te rn divers e e dte eie aay H 4 NovaScale R422 rear view of Riser architecture H 5 NovaScale R422 Rear view connectors eene H 5 NovaScale R460 risers and I O subsystem slotting c cccsccceeesceeeeeeeeteseeeeseeeesnseeeees H 7 Rear view of NovaScale R460 Series eese eene enne nns H 7 xii BASS for Xeon Installation and Configuration Guide List of Tables Table 8 1 Table H 1 Table H 2 Table H 3 Table H 4 Table H 5 Voltaire ISR 9024 Switch Terminal Emulation Configuration 8 2 PCIX Adapter Tet lec ITE D OO ee ee ede H 2 CHE H 2 NovaScale R421 Slats and Commectars ccccccccccdscscacocesddsorasssencvncadsacncavsncdardcesdieseredeuoentens H 4 NovaScale R422 Slots and Commectors e ccicsscssscdcacosevesseresstcsconendesvssavendevstaceseiesstenansncoes H 6 NovaScale R460 Slots and Connectors cccccscisssscacosevessersassencnsassavusavtecdvenncnvdecssevasioontons H 8 Table of Contents xiii xiv BASS for Xeon Installation and Configuration
121. e managed automatically This is highly recommended If the r option is not used the Administrator will have to manage the backup of the etc storageadmin disknaming conf file himself When used remotely r option immediately after a ksis image re deployment or a node system restoration the stordiskname command must be used in update mode u option This ensures that the LUNs are addressed by the same symbolic link names as used previously and avoids having to configure the file system again Configuring I O Resources for the Cluster 5 7 The stordiskname command should be executed from the Management Node as shown below possibly with the m SCSI ID option see Note above If the node is NOT in a High Availability pair stordiskname u r node name If the node is in a High Availability poir stordisknam u r nodel name node2 name The symbolic links aliases must be recreated on each node using the information contained within the disknaming conf file newly created by stordiskname To do this run the stormap command as described previously ssh root8 node name stormap c 5 8 BASS for Xeon Installation and Configuration Guide Chapter 6 Configuring File Systems Three types of file structure are possible for sharing data and user accounts for BASS for Xeon clusters e NIS Network Information Service can be used so that user accounts on Login Nodes are available on the Compute
122. e to be connected to the node L LAN Local Area Network LDAP Lightweight Directory Access Protocol LUN Logical Unit Number M MAC Media Access Control a unique identifier address attached to most forms of networking equipment MDS MetaData Server MDT MetaData Target MKL Maths Kernel Library MPI Message Passing Interface N NFS Network File System NPTL Native POSIX Thread Library G 2 BASS for Xeon Installation and Configuration Guide NS NovaScale NTFS New Technology File System Microsoft NTP Network Time Protocol NUMA Non Uniform Memory Access NVRAM Non Volatile Random Access Memory O OEM Original Equipment Manufacturer OPK OEM Preinstall Kit Microsoft OST Object Storage Target P PAM Platform Administration and Maintenance Software PAPI Performance Application Programming Interface PCI Peripheral Component Interconnect Intel PDU Power Distribution Unit PMB Platform Management Board PMU Performance Monitoring Unit PVFS Parallel Virtual File System Q R RAID Redundant Array of Independent Disks ROM Read Only Memory RSA Rivest Shamir and Adleman the developers of the RSA public key cryptosystem 5 SAFTE SCSI Accessible Fault Tolerant Enclosures SCSI Small Computer System Interface SDP Socket Direct Protocol SDPOIB Sockets Direct Protocol over Infiniband SDR Sensor Data Record
123. ear Generate configuration files tmp CfgSwitches eswu 0cl confg tmp CfgSwitches eswulc0 confg tmp CfgSwitches eswulcl confg Temporary configuration files will start with 192 168 101 1 ip address 255 255 255 0 netmask 2 Preinstallation of switches At this stage the following actions are carried out e Temporary configuration of the ethO network interface aliases and reconfiguration of the DHCPD service on the Service Node e configuration files are copied to the tftpboot directory e DHCP service is reconfigured and restarted These actions are carried out by running the command swtAdmin preinstall dbname database name gt network lt admin backbone gt logfile logfile name gt verbose help While this command is being carried out the following message will appear Pre installation of switches copy configuration files in tftpboot directory Configuring Switches and Cards 9 3 9 4 WARNING we are looking for uninstalled switches Please wait Pre installed X new switches Note After this step has finished the switches will use the temporary configuration 3 Discovering new switches on the network If the cluster includes more than one switch the netdisco application runs automatically in order to discover the network topology This is carried out by running the command swtAdmin netdisco first device name to start netdisco gt ne
124. econdary DNS X Cancel Release Notes Back amp p Next Figure 3 12 Network Configuration Screen Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 13 6 The next step is used to configure network access for the Management Node Click on manually and enter the hostname this is shown as xenaO in the example above Select the device connected to the cluster management network normally this is ethO and click on the Edit button Enter the IP address and NetMask configuration settings see Figure 3 11 The miscellaneous settings for the Gateway Primary DNS and Secondary DNS can be configured if necessary Warning messages may appear if this is not done and can be ignored Click on the OK and Next buttons in Figure 3 11 when all the network configurations have been set Note host name in the screen grab must be replaced by the name of the Management Node The IP addresses in the screen above are examples and will vary according to the cluster 3 1 6 Time Zone Selection and Root Password RED HAT ENTERPRISE LINUX 5 Please click into the map to choose a region Europe Paris lt System clock uses UTC 3 Release Notes 4 Back amp Next E Figure 3 13 Time Zone selection screen 7 Select the Time Zone settings required as shown in Figure 3 12 and click on Next Note Bull recommends using UTC check the System clock uses UTC box
125. ectory unalias cp cp a media cdrecorder release XBAS5V1 2 Note Ifthe unalias cp command has already been executed the message that appears below can be ignored bash unalias cp not found 3 Eject the XLustre DVD ROM 3 2 4 Installing the Bull 5 for Xeon software N The mandatory RHEL packages and general BAS5 for Xeon products will be installed automatically by default 3 22 55 for Xeon Installation and Configuration Guide Go to the release XBAS5V1 2 directory cd release XBAS5V1 2 The software installation commands for the Management Node correspond to the Function Product combination applicable to the Service Node which includes the Management Node See Chapter 1 for a description of the different architectures and functions possible The BASS for Xeon install command syntax is shown below install func MNGT IO LOGIN prod XIB XLUSTRE XTOOLKIT The func option is used to specify the node function s to be installed and can be a combination of the following MNGT for management functions O for IO NFS functions LOGIN for login functions Different combinations of products can be installed using the prod flag The prod options include the following XIB to install the BAS5 for Xeon InfiniBand software This needs to be purchased separately XLUSTRE to install the BAS5 for Xeon Lustre software This
126. ed interactively for the partitioning auto kickstart will use a predefined partitioning The auto option will only handle the sda disk and will leave other node disks as previously partitioned Use the manual partitioning option if other previously partitioned disks need to be repartitioned The auto kickstart options are shown below usr opt tmp var swap ext3 ext3 ext3 ext3 ext3 16 GBs 10 GBs 10 GBs 10 GBs 10 GBs The remaining disk space sda sda sda sda sda sda 3 The question Do you want to enable vnc mode will appear If you answer no it will be possible to follow the installation via a serial line conman 4 The path that includes the BAS5v1 2 for Xeon software installer This will be something like release XBAS5V1 2 A list of potential paths will be displayed as shown below Select the path for the Bull HPC installer 0 Choose Custom PATH 1 NONE 2 release XBAS5V1 2 Enter the number for the path 5 The HPC node functions that you want to install The possible options are IO LOGIN COMPUTE COMPUTEX See Chapter 1 for more details regarding the different BASS for Xeon architectures Some of these functions may be installed together as shown for the group C functions below Select the node functions to be installed Node functions from the same group can be added together for example IO and LOGIN Node functions from different groups are exclusive 1
127. edia cdrecorder 3 Insert the RHEL5 1 DVD into the DVD reader and movunt it mount dev cdrom media cdrecorder 4 Copy the RHEL5 1 files to the release RHEL5 1 directory cp a media cdrecorder media cdrecorder discinfo release RHEL5 1 Note This step will take approximately 7 minutes 5 Eject the DVD umount dev cdrom 3 20 55 for Xeon Installation and Configuration Guide or use the eject command eject 6 Ifthe RHEL5 1 Supplementary for EM64T CDROM is part of your delivery carry out steps 7 to 11 below The Java virtual machine rpm on RHELS 1 Supplementary for EM64T CDROM has be installed later on clusters that use the hpcviewer tool included in HPC Toolkit 7 Create the directory mkdir p release RHEL5 1 Supplementary 8 Insert the RHEL5 1 Supplementary for EM 4T CDROM into the CD reader and mount it mount dev cdrom media cdrecorder 9 Copy the RHEL5 1 supplementary files into the release RHEL5 1 Supplementary directory cp a media cdrecorder release RHEL5 1 Supplementary 10 Eject the DVD umount dev cdrom or use the eject command eject 3 2 2 Preparing the Installation of the 5 for Xeon XHPC software 1 Create the directory for the BASS for Xeon v1 2 XHPC software mkdir p release XBAS5V1 2 2 Insert the BASS for Xeon v1 2 XHPC DVD ROM into the DVD reader and mount it
128. een compiled with Intel compilers The Bull XHPC CD ROM installs the intelruntime lt version gt Bull X x86_64 rpm which contains various free distribution Intel libraries that are needed for these applications to work on all node types Management I O Login COMPUTEX and COMPUTE These libraries are installed in the opt intelruntime lt version gt folder where version equals the compiler version number for these libraries For example for applications which have been compiled with version 10 1 011 compilers the folder is named 10 1 011 The opt intelruntime lt version gt path should be added to the LD_LIBRARY_PATH environment variable in the shell configuration file so that the applications delivered on the Bull XHPC CDROM can run If there is a desire to install a different version of an Intel compiler then this has to be copied on to the other nodes in order to ensure coherency At the same time the path in the LD_LIBRARY_PATH variable has to be modified to include the new version reference Intel Compilers Install the Intel Compilers as and when required This is not necessary if they have been systematically deployed previously see STEP 5 in Chapter 3 The compilers must be installed on the node which contains the Login functionality this may be a dedicated node or one which is combined with the I O and or Management functionalities Follow the instructions written in the Bull notice supplied with the compiler Fortran Co
129. ence Nodes and the SLURM configured there as well p lt password gt Optional If there is a need to create a logon for the slurmuser user name a password can be specified that will be applied for slurmuser on all the nodes of the cluster b lt base_pathname gt Optional If SLURM is installed in a directory other than the usr default the path to the install directory should be specified here e g opt slurm This also affects the location of the SLURM configuration file if b is not specified the SLURM configuration file will be accessed using the default etc slurm slurm conf path If b is specified the configuration file will be accessed at lt base_pathname gt etc slurm conf v Optional verbose option If set additional progress messages are outputted when the script is executed d Optional debug option If set parameters and variable names are outputted when the script is executed to help debugging F Optional force option If slurmuser or slurmgroup already exist on any of the nodes this option may be used to force the deletion and recreation of the user name and group name T Internal script option Set for subordinate scripts in order to inhibit actions that are only performed by the main script e g generation of credential keys Manually configuring SLURM on the Reference Nodes If there is a problem with the SLURM setup script then SLURM can be configured manually on the Reference Nodes The following s
130. entered in by the user are in Courier New for example COMI Commands files directories and other items whose names are predefined by the system are in Bold as shown below The etc sysconfig dump file e use of Italics identifies publications chapters sections figures and tables that are referenced e lt gt identifies parameters to be supplied by the user for example node name WARNING A Warning notice indicates an action that could cause damage to a program device system or data Preface iii iv BASS for Xeon Installation and Configuration Guide Table of Contents Chapter 1 Cluster Configuration 1 1 1 1 Lun me HN PT 1 1 2 Hardware Configuration 2c vacntaueayancswaenscesadiosdiuass aad nieadaa 1 1 1 2 1 BASS for Xeon Cluster are Fr 1 1 1 2 2 Different architectures possible for BASS for 1 2 1 2 3__ Service 0 mE 1 4 124 ades ee eee paar 1 7 NU NES neon 1 10 1 2 6 High Speed Interconnection o pieta RUP tertie Rite teque 1 10 Ier ee 1 12 1 3 Software Environment e 1 14 1 3 1 Main Console and Hardware Management 1 14
131. eon Maintenance Guide for more information on the IBS tool ibs a topo s subnet manager IP address or hostname DESCRIPTION HOSTNAME NODEGUID NODELID LOCATION ISR9024D Voltaire iswu0c0 2 0 0008 10400411946 0x0017 A 2 RACK1 B In this case the node GUID is Ox0008f1040041 1946 8 3 2 2 Determining the node GUIDs for a chassis switch Find the Spine lines and look out for the NODEGUID field Use the IBS tool as below to identify the node GUID ibs a topo s subnet manager IP address or hostname PART ASIC NODESYSTEMGUID NODEGUID NODELID CHASSIS 5 1 4 3 Ox0008 10400401e60 0x0008f 10400401elb 0x0001 iswuOcO In this case the node guid is Ox0008f10400401e1 b Repeat for all spines on all switches Alternatively the IBS tool version 0 2 8 can be used to produce the same information as follows user host ibs a showspines s subnet manager IP address or hostname Available spines 0x0008 10400401elb 0 0008 10400401 1 8 3 3 Adding new Spines Each spine is specified using an index nodeguid tuple as follows Note that the index can be any positive integer and its value does not impact performance switchname config sm spines add 1 0x0008f10400411946 The change will take effect after the next fabric reconfiguration Note If the switch firmware is Voltaire version
132. er number of new switches netaddress network ip for temporary config netmask netmask for temporary configuration network admin backbone first device name to start netdisco gt dbname database name gt logfile lt logfile name gt verbose help usr sbin swtAdmin auto switch number 4 network backbone Actions generate preinstall netdisco mac update install save auto step by step clear Options help dbname verbose logfile switch number first netaddress netmask network Generate configuration files Copy configuration files in the tfpboot and restart DHCPD for the pre installation of the switches Run netdisco in order to discover new switches Update database with the MAC address of the new switches Install new switches Save the configuration of the new switches Full configure and installation of switches Interactive configuration and installation of switches Delete temporary configuration files Display this message Specifies the name of the database default value ClusterDB Debug mode Specifies the logfile name default var log switchcfg log Number of switches to install default 1 Specifies the IP address or name of device to start netdisco Specifies the network IP to use for the pre install configuration Specifies the netmask to use for the pre install configuration Specifies the type of network to be installed admin o
133. eries storage devices 4 15 4 5 3 Complementary Configuration Tasks for EMC Clariion AX4 5 storage devices 4 16 4 5 4 Configuring the EMC Clariion DGC Access Information from the Management Node 4 16 4 6 Updating the ClusterDB with Storage Systems Information 4 17 4 7 Storage Management SerVIGes oo uoi Ene Rs err ra RON PE CH EORR LER aH oid aos i i tft este 4 17 Table of Contents vii 4 8 Enabling Brocade Fibre Channel Switches Da e OP RP hearin 4 18 4 8 1 Enabling Access from Management Node sssssssssneeeees 4 18 4 58 2 lt Updating he Clusiel DB sc oo tee cmo C Heide mer mi Ov 4 18 Chapter 5 Configuring I O Resources for the Cluster 5 1 5 1 Automatic Deployment of the I O Configuration ssssssssssssseee 5 1 5454 Storage Model Fes usos e e teases ahd at aa Cie et eben oie Rat 5 1 5 1 2 Automatic Configuration of a Storage System csse remedio eet apart 5 2 5 1 3 Automatic Deployment of the configuration of I O resources for the nodes 5 4 5 2 Manual Configuration of I O Resaurees cse rhe ox ve Coh eld petet tla tst 5 5 2 1 Manual Configuration of Storage Systems 5 5 5 2 2 Manual Configuration of I O resources for 5 5 Chapter 6 Configuring File Systems ei det Probo ra derat betae En 6 1 6 Setti
134. etting with an external time source such as a radio or satellite receiver It covers only time synchronization between the Management Node and other cluster nodes the Management Node being the reference time source Note Itis recommended that the System Administrator synchronizes the Management Node with an external time source Modify the etc ntp conf file on the Management Node as follows The first two lines must be marked as comments restrict default kod nomodify notrap nopeer noquery restrict 6 default kod nomodify notrap nopeer noquery Leave the lines restrict 127 0 0 1 restrict 6 1 The next line should have the following syntax assuming that the parameters used are for the management network with an associated netmask restrict lt mgt_network_IP_address gt mask lt mgt_network_mask nomodify notrap gt For example if the IP address of the Management Node alias is 172 17 0 99 restrict 172 17 0 0 mask 255 255 0 0 nomodify notrap Put the following lines in as comments server O rhel pool ntp org server l rhel pool ntp org server 2 rhel pool ntp org Leave the other command lines and parameters unmodified Restart ntpd service service ntpd restart 3 32 55 for Xeon Installation and Configuration Guide 3 3 10 Start ntptrace with the IP address as the Management Node alias x x 0 99 Example ntptrace 172 17 0 99 ns0 stratum 11 offset 0 000000 synch distance 0 012515 Conf
135. evices declared in the model are initialized correctly and are accessible via their management interface ee When a storage model is deployed any existing configuration details that are in place are overwritten All previous data will be lost Initial conditions For some storage systems EMC CLARIION the LUNs can only be accessed using authorized Fibre Channel adapters HBAs for the hosts connected to the storage system This access control is based on the Worldwide Names WWN of the FC adapters So these WWN details must be collected and stored in the Cluster Database using the following command BASS for Xeon Installation and Configuration Guide ioregister a The collection of I O information may fail for those nodes which are not yet operational in the cluster Check that it succeeded for the nodes referenced by the Mapping directives in the model file i e for the nodes that are supposed to be connected to the storage system Configuration process 1 Create or reuse a storage configuration model and copy it into the etc storageadmin directory on the Management node cd etc storageadmin 2 Apply the model to the storage systems stormodelctl m model name c applymodel A WARNING This command is silent and long Be certain to wait until the end To have better control when applying the model on a single system it is possible to use the verbose option as below
136. ey has been generated generate a key with the ssh keygen command ssh keygen b 1024 t rsa E The default directory should be accepted This command will request a passphrase to retrieve the password Do not use this function press the return key twice to ignore the request 4 The public key for the FDA manager Linux system should be copied with sshd scp id rsa pub lt administrator gt lt LinuxFDAhost gt lt LinuxFDAhost gt can be a host name or an IP address Replace lt administrator gt with the existing administrator login details 5 Connect to the Linux system FDA manager Configuration of Storage Management 4 5 ssh lt administrator gt lt LinuxFDAhost gt 6 Do not destroy the ssh authorized_keys file Run mkdir p ssh cat id rsa pub gt gt ssh authorized keys rm id rsa pub Note If necessary repeat this operation for other pairs of Linux and FDA manager users Enabling password less ssh execution for the Apache server for the Management Node ssh may also be activated from the Linux Apache account For this specific user sudo must be configured Check that the appropriate rights have been set for the nec_admin command grep nec_admin etc sudoers This command should return the following line Sapache ALL root NOPASSWD usr sbin nec_admin If this does not happen run visudo to modify the sudoers file and add the line above 4 2 2 C
137. figured Configuring AOC USASLP S8iR RAID Adapters for NovaScale R423 and R425 machines G9 G 10 BAS5 for Xeon Installation and Configuration Guide Appendix H PCI Slot Selection and Server Connectors This appendix provides detailed information regarding the choice of PCI slots for high bandwidth PCI adapters The configuration rules put forward ensure the best performance levels without I O conflicts for most type of applications System diagrams are included which may be used to configure the hardware connections The following topics are described e 1 How to Optimize I O Performance e 2 Creating the list of Adapters e H 3 Connections for NovaScale R4xx Servers H 1 How to Optimize I O Performance The I O performance of a system may be limited by the software and also by the hardware The I O architecture of servers can lead to data flows from PCI slots being concentrated on a limited number of internal components leading to bandwidth bottlenecks Thus it is essential to look at the installation of PCI adapters and slot selection carefully to reduce any limitations as much as is possible One good practice is to avoid connecting bandwidth hungry adapters to the same PCI bus The following details should be ascertained in order to ensure the highest possible performance for the adapter installation e Adapter characteristics maximum theoretical performance and expected performance in the operational context
138. file e addresses predefined in the ClusterDB for the management ports These may be retrieved using the storstat command 4 2 1 Installing and Configuring FDA software on a Linux system On Linux the disk array table in the ClusterDB contains the mgmt node field which is the foreign key for the node table This table contains information for example the IP address for the FDA storage manager The Storage Manager server and the CLI software may be installed on a Linux system planned for FDA management Note Storage Manager GUI client can only be installed on Windows 1 Install the RPMs rpm iv ISMSMC RPM ISMSVR RPM The ISMSMC RPM is located on the FDA series StoreWay Manager Integration Base CDROM The ISMSVR RPM is located on the FDA series StoreWay ISM Storage Manager CDROM 2 FDA Manager Configuration a etc iSMsvr iSMsvr sample file into the etc iSMsvr iSMsvr conf file Add the lines that define the disk arrays to be managed using the syntax shown in the example below 4 3fda1500 Two IP addresses are defined diskarrayl ip 172 17 0 200 172 17 0 201 4 4fda2500 Two IP addresses are defined diskarray2 ip 172 17 0 210 172 17 0 211 b Add the following line in the client section after the default line for login in the iSMsvr conf file Note that the admin user and the admin password details must be consistent with
139. firmware 8 15 8 7 Verifying Voliairetonhquralion ceo e ete eve at een ie titres 8 16 8 8 Voltaire GridVision Fabric Manager eene 8 16 8 9 Information on Voltaire Devices ener 8 16 Chapter 9 Configuring Switches and Cards essssse 9 9 Configummg lhemebipwitehes cox mascota e e t af E a edi 9 9 1 1 Ethernet Installation scripts s ose EO n ONERE EROR RUE RSS De SRM dm M SD ERE 9 1 9 1 2 swtAdmin Command Option Details oc viuaraascagacaye 9 2 9 1 3 Automatic Installation andConfiguration of the Ethernet 9 2 9 1 4 Ethernet Switch Configuration Procedure cinere t deeds 9 3 91 5 Ethernet Switches Configuration File sss ossa du retener eo ets RE dR 9 6 9 1 6 Ethernet Switches Initial Configuration ssssssssssseeeeeeeee 9 6 9 1 7 Basic Manual Configuration eene 9 8 9 2 Configuring a Brocade SWIIChs suit c a Bao cob a undue 9 15 9 3 Configuring Voltaire Devices zs ce a det iin ic oit fub Elo RR 9 16 9 4 Installing Additional Ethernet Boards diste edis gb vss tes 9 17 Appendix A Default Logins for different cluster 1 Appendix B Cluster Database Operations ss B 1 B 1 Migrating to BASS for Xeon seb 2s acutae eU dete MENTO EH civ eb tula suc
140. fix 0xfe80000000000000 port state change trap nabl bad ports mode disable pm mode enable grouping mode enable 2 To change the topology setting to stage CLOS run the command below lt switchname gt config sm sm info topology set 3 or to change the topology setting to 5 stage CLOS lt switchname gt config sm sm info topology set 5 The changes will take effect after the next Fabric Reconfiguration 3 For both CLOS 3 and CLOS 5 topologies some of the switches or switch ASICs will need to be declared as spines as shown in the sections which follow 8 3 2 Determining the node GUIDs 2 Before starting the Administrator should know which Voltaire switches are the top switches Contact Bull if this information is not available All the top switches must be defined as spines Each top switch is identified using its node GUID There are 2 possible cases e top switch is an ISR9024 DM e top switch is not an ISR9024 DM i e the switch is a chassis switch 1569096 DM ISR9288 DM ISR2012 etc 8 3 2 1 Determining the node GUIDs for a Voltaire ISR9024 DM switch Look for the NODEGUID fields of all top switches 8 8 BASS for Xeon Installation and Configuration Guide 1 For Voltaire ISR 9024 switches make a note of the NODEGUID identifier which is shown when the ibs topo action command is run as shown in the example below See Chapter 2 in the BASS for X
141. for example 3 Compute Nodes are listed as ns 2 4 then enter the following command for the deployment ksis deploy imagel ns 2 4 Note reference nodes may be kept as reference nodes and not included in the deployment Alternatively the image may be deployed on to them so that they are included in the cluster It is recommended that this second option is chosen Post Deployment Node configuration Edit the postconfig script Clusters with Ethernet interconnects only Before running the postconfig command below the script will need editing as follows for Ethernet clusters 1 Run the command below to disable the configuration of the interconnect interfaces ksis postconfig disable CONF 60 IPOIB 2 Recompile the postconfig script by running the command below ksis postconfig buildconf Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 63 3 6 42 3 6 4 3 3 64 Do not edit the postconfig script when installing on InfiniBand clusters postconfig command Once the image deployment has finished the cluster nodes will need to be configured according to their type Compute I O etc Post deployment configuration is mandatory as it configures Ganglia Syslog ng NTP SNMP and Pdsh automatically on these machines It also allows the IP over InfiniBand interfaces to be configured according to the information in the Cluster database The Ksis postconfig command configures each n
142. formance and cluster security reasons it is advised to connect the backbone to the Login and Management Nodes only High Speed Interconnection InfiniBand Networks The following devices may be used for InfiniBand clusters Voltaire Switching Devices For InfiniBand Networks the following Voltaire devices may be used e 400 Ex D Double Date Rate DDR Host Channel Adapters which can provide a bandwidth up to 20 Gbs per second host device PCI Express e 158 9024D switch with 24 DDR ports e Clusters with up to 288 ports will use Voltaire ISR 9096 or 9288 or 2012 Grid Directors to scale up machines which include 400 ExD HCAs and ISR 9024 switches BASS for Xeon Installation and Configuration Guide 1 2 6 2 e Clusters of more than 288 ports will be scaled up using a hierarchical switch structure based on the switches described above See For more information on installing and configuring Voltaire devices refer to the Chapter on Installing and Configuring InfiniBand Interconnects in this manual and to the Bull Voltaire Switches Documentation CD Mellanox ConnectX Dual Port Cards Mellanox ConnectX InfiniBand cards support Dual 4x ports providing a bandwidth of 10 or 20 or 40 Gb s per port They support PCI Express 2 0 but are compatible with PCI Express 1 1 and fit x8 or x16 slots part number DCCH406 DPOO should be used with NovaScale R421 R422 R421 and R422 Compute Node
143. ftware on LOGIN I O COMPUTE and COMPUTEX Reference Nodes This step describes how to install and configure SSH kdump SLURM and PBS Pro as necessary for the Reference Nodes to be deployed It also describes the installation of compilers on the Login Nodes and the configuration of the MPI User environment 3 5 1 Configuring SSH and etc hosts 2 NJ rporoni These tasks must be performed before deployment 3 5 1 1 For a reinstallation of BAS5 for Xeon v1 2 Retrieve the SSH keys of the nodes and of the root user which have been saved previously see section 3 0 2 To do this e Restore the etc ssh directory of each type of node to its initial destination e Restore the root ssh directory on the Management Node e Go to the root directory cd root e From the management Node copy the root ssh directory on to the COMPUTE X and LOGIN and I O Nodes scp r ssh lt node_name gt root e Restart the SSH service on each type of node service sshd restart Notes SSH keys of the users can be restored from the files saved by the administrator for example username ssh The sudo configuration will have been changed during Bull XHPC software installation to enable administrators and users to use the sudo command with ssh By default sudo requires a pseudo tty system call to be created in order to work and this is set by the requiretty option in the etc sudoers configu
144. g Munge for clusters which use this authentication type for the SLURM components 2 6 BASS for Xeon Installation and Configuration Guide 27 Deploy the BAS5 for Xeon v1 2 Reference Node Images 2 7 1 Deployment Pre Requisites The following pre requisites should be in place before the new BASS for Xeon v1 2 images are created and deployed by Ksis e Ksis Image Server has been installed on the Management Node e node descriptions and administration network details in the cluster database are up to date and correct e cluster database is accessible This can be checked by running the command ksis list The result must be no data found or an image list with no error messages e All the nodes that will receive a particular image for example the COMPUTEX image are hardware equivalent that is use the same NovaScale platform disks and network interfaces e All system files are on local disks and not on the disk subsystem Before creating an I O node image for example all disk subsystems must be unmounted and disconnected e Each node is configured to boot from the network via the ethO interface If necessary edit the BIOS menu and set the Ethernet interface as the primary boot device e All the nodes for the deployment are powered on This can be checked by running the nsctrl command for example nsctrl status xena 1 100 Any nodes that are shown as inactive will need to be powered on e All the
145. g RHEL5 1 BAS5v1 2 for Xeon Software and optional HPC software products OM IMS ROARS cse e a atrae re i e qat on v repe abe dept ve videte Ande 3 45 3 4 1 Preparenis script prerequisites gcc ecce roe tees UN bate pip edu d HUE 3 45 3 4 2 Preparing the NFS node software installation sssssssseeee 3 45 3 4 3 Launching the NFS Installation of the BAS5v1 2 for Xeon 3 48 STEP 5 Configuring Administration Software on LOGIN I O COMPUTE and COMPUTEX R fsrence wales ORB De n NATI Cor Fe N 3 49 3 5 Configuring SSH and ete cueste Gare Ra etr os ein S Unt Sad 3 49 3 5 2 Configuring rel el oT TTE 3 50 3 5 3 Configuring the kdump kernel dump tool ssssssssssse 3 5 3 5 4 Installing and Configuring SLURM optional 3 52 BASS for Xeon Installation and Configuration Guide 3 5 5 Installing and Configuring the PBS Professional Batch Manager optional 3 56 Boe Season ron ote etit e ctu et aut Cae dit rete 3 58 3 5 7 Intel Math Kernel Library MKL scott Oo vast ore eden 3 58 3 5 8 Configuring the MPI User Environment eerte ee nia Oa eO Ogte ub e trs 3 58 9 5 9 SIAC simia Vp Dares gu aue Span etit uta ecu euet 3 59 3 5 10 NVIDIA Tesla Graphic Card accelerators 3 59 3 5 11 NVIDIA CUDA Toolkit
146. ge below is displayed Connected Switch 9 1 7 1 Configuring a CISCO Switch 1 Set the enable mode Switch gt enable 2 Enter configuration mode Switch configure terminal Enter configuration commands one per line End with CNTL Z Switch config 3 Set the name of the switch in the form hostname switch name For example Switch config hostname myswitch myswitch config 4 Enter the SVI 1 interface configuration mode myswitch config interface vlan 1 myswitch config if 5 Assign an IP address to the SVI of 1 in the form ip address ip a b c d gt netmask a b c d myswitch config if ip address 10 0 0 254 255 0 0 0 myswitch config if no shutdown 6 Exit the interface configuration myswitch config if exit myswitch config 7 Set the portfast mode as the default for the spanning tree myswitch config spanning tree portfast default Configuring Switches and Cards 9 9 Warning this command enables portfast by default on all interfaces You should now disable portfast explicitly on switched ports leading to hubs switches and bridges as they may create temporary bridging loops 8 Seta password for the enable mode For example myswitch config fenable password myswitch 9 Seta password for the console port myswitch config fsline console 0 myswitch config line password admin mys
147. ged switch that collects error and bandwidth statistics It is essential to ensure that it is running with the correct setup Installing and Configuring InfiniBand Interconnects 8 11 8 4 1 Performance manager menu Enter the FTP configuration menu as follows ssh enable switchname enable switchname s password voltaire Welcome to Vo Connecting ltaire Switch switchname switchname config switchname config pm switchname config pm 8 4 2 Activating the performance manager Performance Manager is activated as follows switchname config pm pm mode set enable Once activated configure the performance manager to enable reporting switchname config pm pm report Check that everything is OK by using the pm show command nable s switchname config pm pm show pm mode Trap mask Polling interval Scope Reset scope Counter operation Symbol error counter threshold Link error recovery counter threshold Lin Por Por Por Por Port rcv cons Port xmit con Local link in rcv remo ET PE XT Ty downed counter threshold rcv errors threshold te physical errors threshold rcv switch relay errors threshold xmit discards threshold traint errors threshold straint errors threshold tegrity errors threshold V115 dropped Port rcv data Port rcv pkts Report mode alert join alert ATS Excessive buffer overrun errors threshold threshold Por
148. guration of the cluster The command options are shown below Usage config ipoib n node a b x d device m netmask s suffixe Command options h help print this message n node node to update pdsh form node a b x or ssh form root node d device ip device default ibO m masque ip net mask default 255 255 0 0 s lt suffixe gt name suffix in etc hosts default icO In the example below the command will create the configuration file ifcfg ethl on the nodes zeus8 to zeus16 to configure the eth interface for these nodes using the IP addresses listed in the etc hosts file for the zeus8 icl zeus16 ic1 interfaces config ipoib n zeus 8 16 d ethl m 255 255 0 0 s icl Configuring Interconnect Interfaces 1 E EST E 2 Interface Description file Ethernet Adapters The Ethernet interconnect adapter will be identified by a logical number by using the format eth 1 2 for example eth and eth2 The IP properties address netmask etc for the Ethernet adapter are configured using a description file named etc sysconfig network script ifcfg eth 1 2 InfiniBand Adapters The InfiniBand interconnect adapter will be identified by a logical number by using the format ib 0 1 2 for example ibO and ib1 The IP properties address netmask etc for the InfiniBand adapter are configured using a description file named etc sysconfig network script ifcfg ib
149. gure button See The Storeway Optima 1250 Quick Start Guide for more information Once the network settings are configured you can start StoreWay Master using a web browser by entering the explicit IP address assigned to the embedded StoreWay Master server followed by the port number 9292 for example http lt IP_address gt 9292 If the default settings are changed user name admin password password then the user name and password settings in the xyradmin and xyrpasswd fields of the etc storageadmin xyr_admin conf file will have to be updated Configure SNMP using the StoreWay Master GUI firstly select the Settings button and then the SNMP button If this is the first time that SNMP has been set you will be asked for the paper licence details that are included with the Optima 1250 storage system Using the SNMP menu enter the IP address of the management station and deselect the information level box for this trap entry leave the warning and error levels checked Check that end to end access has been correctly set up for the cluster Management Node using the command below xyr admin i optima 1250 IP address c getstatus all Configuration of Storage Management 4 13 4 5 Enabling the Administration of EMC Clariion DGC storage systems 4 5 1 Initial Configuration See The appropriate EMC CLARiiON CX3 Series Model XX 40 or 20 or 10c Setup Guide delivered with the storage system for more detail
150. he same as the tftp server 13 Disconnect the Foundry Networks Switch Once the switch configuration has been saved and the Administrator has exited from the interface it will then be possible to disconnect the serial line which connects the switch to the Linux Management Node 14 The configuration can be checked as follows From the Management Node run the following command telnet 10 0 0 254 Enter the password when requested Set the enable mode enable Enter the password when requested Display the configuration with the show configuration command Two examples are shown below Model FLS648 telnet myswitch show configuration Startup config data location is flash memory Startup configuration l ver 04 0 00T7e1 fan threshold mp speed 3 50 90 module 1 fls 48 port copper base modul I hostname myswitch ip address 10 0 0 254 255 0 0 0 BASS for Xeon Installation and Configuration Guide Model RX4 telnet myswitch show configuration Startup config data location is flash memory f Startup configuration ver V2 3 0dT143 module 1 rx bi 10g 4 port module 2 rx bi 10g 4 port 3 modul rx bi 1g 24 port copper vlan 1 name DEFAULT VLAN router interface ve 1 enable telnet password nable super user password logging facility localO hostname myswitch interface management 1 ip address 209 157 22 254 24 interf
151. he Lustre File Systems once wee 3 4 3 0 5 Saving the SLURM Conftqureftotiz s ott RR eR ER ORE DRE qud 3 4 STEP 1 Installing Red Hat Enterprise Linux Software on the Management Node 3 5 3 1 1 Configure Internal RAID discs for BASS for Xeon clusters 3 5 3 1 2 Red Hat Enterprise Linux 5 Installation sssssssse e 3 5 3 1 3 Red Hat Linux Management Node Installation Procedure 3 6 DISSE E 3 0 3 1 5 Network access Cantiqumdlams ox ess fice x exte tee dE E RD MA E 3 13 3 1 6 Time Zone Selection and Root Password sssssssss 3 14 3 1 7 Red Hat Enterprise Linux 5 Package Installation 3 15 SOLEO BSbbonbSeHDS ite anew RE EE ed ba cmm unten E tA ipte 3 17 91 92 Network Gontgutalioiscn esc centi etae ete verto aeria oes rtg noes 3 17 3 1 10 External Storage ott dete rrt regente to tes Ee eset 3 19 STEP 2 Installing BASS for Xeon software on the Management Node 3 20 3 2 1 Preparing the Installation of the Red Hat software 3 20 3 2 2 Preparing the Installation of the BAS5 for Xeon XHPC 321 3 2 3 Preparing the Installation of the BAS5 for Xeon optional software
152. hes DHCP requests are forwarded Management IP address is configured Log warnings are sent to the node service syslog server The switches system clock is synchronized with the NTP server for the node For clusters configured with VLAN Virtual Local Area Network or with the virtual router configuration additional parameters must be defined using the usr lib clustmngt ethswitch tools bin config script Ethernet Switches Initial Configuration CISCO Switches CISCO switches must be reset to the factory settings This is done manually 1 Hardware reinitialization Hold down the mode button located on the left side of the front panel as you reconnect the power cable to the switch BASS for Xeon Installation and Configuration Guide 9 1 6 2 For Catalyst 2940 2950 Series switches release the Mode button after approximately 5 seconds when the Status STAT LED goes out When you release the Mode button the SYST LED blinks amber For Catalyst 2960 2970 Series switches release the Mode button when the SYST LED blinks amber and then turns solid green When you release the Mode button the SYST LED blinks green For Catalyst 3560 3750 Series switches release the Mode button after approximately 15 seconds when the SYST LED turns solid green When you release the Mode button the SYST LED blinks green 2 From a serial or Ethernet connection Enter the following commands switch gt enable Enter the
153. hes configuration saving configuration of eswu0c0O switch saving configuration of eswu0cl switch saving configuration of eswulcl switch saving configuration of eswulcO0 switch saved configuration of eswu0c0O switch saved configuration of eswu0cl switch saved configuration of eswulcl switch saved configuration of eswulcO0 switch save done 8 Checking the configuration of a switch The configuration of a switch is displayed by running the command Configuring Switches and Cards 9 5 le 9 1 6 9 1 6 1 9 6 swtConfig status name lt name_of_switch gt Ethernet Switches Configuration File This file describes the parameters used to generate the switches configuration file A configuration file is supplied with the package as usr lib clustmngt ethswitch tools data cluster network xml The file structure is defined by usr lib clustmngt ethswitch tools data cluster network dtd file The file contains the following parameters lt DOCTYPE cluster network SYSTEM cluster network dtd gt lt cluster network gt lt mode type any gt lt login acl yes gt lt netadmin name admin gt lt vlan id 1 type admin dhcp yes svi yes gt mac address logger yes gt logging start yes level warnings facility localO gt lt ntp start yes gt lt mode gt lt cluster network gt It specifies that Only the workstations of the administration network are allowed to connect to the switc
154. ice when lustre util is not used 2 Verify your network Lustre requires IP interfaces to be in place On your Lustre MGS node make sure that IPOIB is configured if the InfiniBand modules are available 3 Introduction to the MGS service MGS is delivered as a service matching the cluster suite layout The service is located in etc init d mgs service mgs help Usage mgs start stop restart status install erase reinstall clear Start the MGS service on this node start Start the MGS service using the mount lustre command The mount point is mnt srv_lustre MGS This returns O if successful or if the MGS service is already running stop Stop the mgs service using the umount lustre command This returns O if successful or if the MGS service has already stopped status Status of the MGS service resulting from the mount lustre command This returns O if successful restart Restart the MGS service using the stop and start target This returns O if successful install Installs the MGS service if the service is not already installed or running Creates a folder and file for the loopback device Format using mkfs lustre Size for loopback file is 512 MBs Loopback file name is given by etc lustre lustre cfg file target LUSTRE MGS ABSOLUTE LOOPBACK FILENAME default value is home lustre run mgs loop Returns O if successful erase Erase Remove the MGS backend using the rm remove command on the loopback file
155. ick on the Next button Select the keyboard that is used for your system Click on the Next button RED HAT ENTERPRISE LINUX 5 Select the appropriate keyboard for the system Slovenian Spanish Swedish Installation Number Swiss French Swiss French latin1 To install the full set of supported packages included in SwIsE German your subscription please enter your Installation Number Swiss German latin1 Installation Number Tamil Inscript Tamil Typewriter U S International Ukrainian United Kingdom 3 Figure 3 3 RHELS installation number dialog box Installing BAS5 for Xeon v1 2 Software on the HPC Nodes 3 7 3 8 RED HAT ENTERPRISE LINUX 5 Select the appropriate keyboard for the system Slovenian pm Spanish Q If you re unable to locate the Installation Number consult http www redhat com apps support in html Swedish Swiss French If you skip Swiss French latin1 You may not get access to the full set of packages included in your subscription Swiss German It may result in an unsupported uncertified installation of Swiss German latin1 Red Hat Enterprise Linux You will not get software and security updates for Tamil Inscript packages included in your subscription Tamil Typewriter Turkish U S International Ukrainian United Kingdom C Betease Notes
156. ide When you are within the directory where the keys will reside run the commands below cd etc slurm openssl genrsa out private key 1024 openssl rsa in private key pubout out public key The Private Key file must be readable by SlurmUser only If this is not the case then use the commands below to change the setting chown slurm slurm etc slurm private key chmod 600 etc slurm private key The Public Key file must be readable by all users If this is not the case then use the commands below to change the setting chown slurm slurm etc slurm public key chmod 644 etc slurm public key 3 3 11 6 More Information See The Bull BASS for Xeon Administrator s Guide for more information on SLURM Munge configuration security the creation of job credential keys and the slurm conf file See man slurm conf for more information on the parameters of the slurm conf file and man slurm_setup sh for information on the SLURM setup script 3 3 12 3 3 13 Installing and Configuring PBS Professional Batch Manager optional 2 e PBS Professional does not work with SLURM The PBS license file altair_lic dat must be available as a prerequisite e The FLEXIm License Server has to be installed before PBS Professional is installed See Chapter 4 in the PBS Professional Administrator s Guide available on the PBS Professional CD ROM for more information on the installation and configuration ro
157. iguring the kdump kernel dump tool Kdump will have been enabled during the Red Hat installation on the Management Node see section 3 1 8 1 The following options must be set in the etc kdump conf configuration file a The path and the device partition where the dump will be copied to should be identified by its LABEL dev sdx or UUID label either in the home or directories Examples path var crash ext3 dev volgroup00 logvol00 b The tool to be used to capture the dump must be configured Uncomment the core collector line and add a 1 as shown below core collector makedumpfile c d 1 c indicates the use of compression and d 1 indicates the dump level It is essential use non stripped binary code within the kernel Non stripped binary code is included in the debuginfo RPM kernel debuginfo lt kernel_release gt rpm available from http people redhat com duffy debuginfo index js html This package will install the kernel binary in the usr lib debug lib modules lt kernel_version gt folder Note The size for the dump device must be larger than the memory size if no compression is used Use the command below to launch kdump automatically when the system restarts chkconfig kdump on Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 33 3 3 11 Installing and Configuring SLURM optional b SLURM does not work with the PBS Professional Batch
158. ile In Line 39 replace data source mycluster localhost with data source basename localhost Example data source nova localhost 5 Start gmetad service gmetad start chkconfig level 235 gmetad on 3 3 8 Configuring Syslog ng Syslog Ports Usage 584 udp This port is used by cluster nodes to transmit I O status information to the Management Node It is intentionally chosen as a non standard port This value must be consistent with the value defined in the syslog ng conf file on cluster nodes and this ensured by Bull tools There is no need for action here Modify the syslog ng conf file Modify the etc syslog ng syslog ng conf file as follows adding the IP address Ethernet ethO in the administration network which the server will use for tracking 1 Search for all the lines which contain the SUBSTITUTE string for example Here you HAVE TO SUBSTITUTE ip 127 0 0 1 with the GOOD Inet Address lt Management_Node_IP_address gt Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 3 1 2 Make the changes as explained in the messages 3 substitutions with the alias IP address Restart syslog ng After modifying the configuration files restart the syslog ng service service syslog ng restart 3 3 9 Configuring NTP The Network Time Protocol NTP is used to synchronize the time of a computer client with another server or reference time source This section does not cover time s
159. ill not work BUT the runtime libraries can be used by applications previously compiled with the compiler Installing Intel Tools and Applications 7 3 7 4 BASS for Xeon Installation and Configuration Guide Chapter 8 Installing and Configuring InfiniBand Interconnects The information in this chapter only applies to switches with 3 x version firmware Refer to the Voltaire documentation available on the Bull Voltaire Switches Documentation CD or from www voltaire com if the firmware version is later and or the switch models are different This chapter describes how to install and configure InfiniBand interconnects including Voltaire devices these vary according to the size and type of cluster and Mellanox ConnectX Interface Cards The following topics are described e 8 1 Installing HCA 400 Ex Mellanox ConnectXTM Interface Cards e 8 2 Configuring the Voltaire ISR 9024 Grid Switch 8 3 Configuring Voltaire switches according to the Topology e 8 4 Performance manager PM setup e 8 5 FIP setup e 8 6 The Group menu e 8 7 Verifying the Voltaire Configuration 8 8 Voltaire GridVision Fabric Manager e 8 9 More Information on Voltaire Devices 8 1 Installing HCA 400 ExD and Mellanox ConnectX Interface Cards Note Refer to the safety information prior to performing the installation 1 Ensure that the host is powered down and disconnect the host from its power source 2 Locate th
160. in switchName new switch name Then exit Configuring Voltaire Devices The Voltaire Command Line Interface CLI is used for all the commands required including those for software upgrades and maintenance The Voltaire Fabric Manager VFM provides the InfiniBand fabric management functionality including a colour coded topology map of the fabric indicating the status of the ports and nodes included in the fabric VFM may be used to monitor Voltaire Grid Director ISR 9096 9288 2012 and Voltaire Grid Switch ISR 9024 devices VFM includes Performance Manager PM a tool which is used to debug fabric connectivity by using the builtin procedures and diagnostic tools The Voltaire Device Manager VDM provides a graphical representation of the modules their LEDS and ports for Voltaire Grid Director ISR 9096 9288 2012 and Voltaire Grid Switch ISR 9024 devices It can also be used to monitor and configure device parameters See For more detailed information on configuring the devices updating the firmware the Voltaire CLI commands and management utilities refer to the Voltaire Switch User Manual ISR 9024 ISR 9096 and ISR 9288 2012 Switches manual provided on the Voltaire Switches Documentation CD BASS for Xeon Installation and Configuration Guide 9 4 Installing Additional Ethernet Boards When installing additional Ethernet cards the IP addresses of the Ethernet interfaces may end up by
161. in the cluster management framework A FDA manager server is able to manage up to 32 storage arrays The server and CLI components must be installed on the same system for as long as the cluster contains less than 32 FDA systems The FDA Manager GUI client The GUI client provides an easy to use graphical interface which may be used to configure and diagnose any problems for FDA systems This component is not mandatory for the integration of the FDA in a cluster management framework Note The external Windows station must have access to the FDA manager server The Linux rdesktop command can be used to provide access to the GUI from the cluster Management Node FDA Storage System Management prerequisites e laptop is available and is connected to the maintenance port MNT using an Ethernet cross cable Alternatively a maintenance port of the FDA is connected to a Windows station e electronic license details are available These have to be entered during the initialisation process e Knowledge of installing and configuring FDA storage systems e User manuals for this storage system should be available e The FDA name must be the same as in the disk array table for the ClusterDB and for the iSM server Configuration of Storage Management 4 3 e FDA Manager user name and password have to have been transferred to the respective necadmin and necpasswd fields in the etc storageadmin nec_admin conf
162. ing the nsctrl command nsctrl status ip node name The output must not show nodes in an inactive state meaning that they are not powered on 5 Check the status of the nodes by running the ksis nodelist command ksis nodelist Creating an Image Create an image of the COMPUTE X and LOGIN and I O or combined LOGIN IO reference nodes according to cluster type installed previously ksis create image name reference node name Example ksis create imagel nsl This command will ask for a check level Select the basic level If no level is selected the basic level will be selected automatically by default after the timeout BASS for Xeon Installation and Configuration Guide 3 6 3 3 6 4 3 6 4 1 Deploying the Image on the Cluster Note Before deploying the image it is mandatory that the equipment has been configured see STEP 3 1 Before deploying check the status of the nodes by running the command ksis nodelist ksis nodelist 2 If the status for any of the nodes is different from up then restart Nagios by running the following command from the root prompt on the Management Node service nagios restart 3 Each node must be configured to boot from the network via the ethO interface If necessary edit the BIOS menu and set the Ethernet interface as the primary boot device 4 Start the deployment by running the command ksis deploy image name node 5 If
163. ithout an existing Serial Interface between the Management Node and the DDNs Connect to the laptop to each serial port and carry out the following operations e Set the IP address on the management ports according to the values of the ClusterDB e Enable telnet and API services Set prompt e Configure and enable the syslog service and transmit the messages to the Cluster Management Node using a specific UDP port 544 e Configure and enable SNMP service traps directed to the Cluster Management Node e Set date and time e Set admin user and password and all singlets according to the values defined in etc storageadmin ddn_admin conf file e Activate SES on singlet 1 e Set the tier mapping mode e Enable the couplet mode BASS for Xeon Installation and Configuration Guide e Activate cache coherency e Disable cache write back mode e self heal e network gateway Notes Thelaptop has to be connected to each one of the 2 DDN serial ports in turn This operation then has to be repeated for each DDN storage unit e administrator must explicitly turn on the 8 and 2 mode on DDN systems where dual parity is required This operation is not performed by the ddn_init command b SATA systems may require specific settings for disks Consult technical support or refer to the DDN User s Guide for more information When the default command has been performed on the system it is recommended
164. ition Storeway Optima 1250 Storage systems Developed on Fibre Channel standards for server connections and Serial Attached SCSI SAS standards for disk connections the system can support high performance disks and high capacity SAS and SATA disks in the same subsystem 2 x 4Gb s FC host ports per controller with a 3 Gb s SAS channel via SAS and SATA protocol interfaces to the disks EMC CLARIiON DGC Storage systems The CX3 Series models benefit from the high performance cost effective and compact UltraScale architecture They support Fibre Channel connectivity and fit perfectly within SAN infrastructures they offer a complete suite of advanced storage software in particular Navisphere Manager to simplify and automate the management of the storage infrastructure They offer RAID protection levels 0 1 1 0 3 5 and 6 all of which can co exist in the same array to match the different protection requirements of data They also include a write mirrored cache a battery backup for controllers and cache vault disks to ensure data protection in the event of a power failure BASS for Xeon Installation and Configuration Guide The CX3 40f model has 8 GB cache memory 8 x 4 Gb s FC front end ports and 8 x 4 Gb s FC back end disk ports It supports up to 240 drives FC or SATA The CX3 20f model has 4 GB cache memory 12 x 4 Gb s FC frontend ports and 2 x AGb s FC back end disk ports It supports up to 120 drives FC or SATA The CX3
165. its own code and communicating other processes through calls to subroutines of the MPI library Bull provides MPIBull2 Bull s second generation MPI library in the BASS for Xeon delivery This library enables dynamic communication with different device libraries including InfiniBand IB interconnects socket Ethernet IB EIB devices or single machine devices The BASS for Xeon User s Guide for more information on Parallel Libraries Batch schedulers Different possibilities are supported for handling batch jobs for BAS5 for Xeon clusters including PBS Professional a sophisticated scalable robust Batch Manager from Altair Engineering PBS Pro works in conjunction with the MPI libraries PBS Pro does not work with SLURM and should only be installed on clusters which do not use SLURM Cluster Configuration 1 15 1 4 1 4 1 1 16 See The BASS for Xeon User s Guide for more information on Batch schedulers the PBS Professional Administrator s Guide and User s Guide available on the PBS Pro CD ROM delivered for the clusters which use PBS Pro and the PBS Pro web site http www pbsgridworks com Bull BAS5 for Xeon software distribution Installing Software and Configuring Nodes The Node distribution architecture planned for your HPC system Management Nodes Compute Nodes Login Nodes I O Nodes must be known before installing the BAS5 for XEON software Chapter 3 explains how to instal
166. k access rights Check that all the directories listed in the slurm conf file exist and that they have the correct access rights for the SLURM user This check must be done on the Management Node the combined LOGIN IO or dedicated LOGIN and COMPUTE X Reference Nodes The files and directories used by SLURMCTLD must have the correct access rights for the SLURM user The SLURM configuration files must be readable the log file directory and state save directory must be writable Starting the SLURM Daemons on a Single Node If for some reason an individual node needs to be rebooted one of the commands below may be used etc init d slurm start or service slurm start or etc init d slurm startclean or service slurm startclean The startclean argument will start the daemon on that node without preserving saved state information all previously running jobs will be purged and node state will be restored to the values specified in the configuration file More Information See The Bull BAS5 for Xeon Administrator s Guide for more information on SLURM Munge configuration security the creation of job credential keys and the slurm conf fileJ See man slurm conf for more information on the parameters of the slurm conf file and man slurm setup sh for information on the SLURM setup script Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 55 9 0 0 Installing and Configuring the PBS Professional Batch Manager
167. ks B 1 Table of Contents ix B 1 1 Migrating Cluster DB Data from BASS for Xeon 1 1 B 1 B 1 2 Migrating Cluster DB Data from BASA for Xeon 1 2 B 1 B 1 3 Migrating Cluster DB Data from BASA for Xeon 1 1 B 2 B 2 Saying and Reinstalling the Cluster DB date tete defert B 2 B2 Saving tine Deter files dnte vede iei ER Oi set io Ee B 2 B 2 2 Reinstalling the Data files s oeste p b ttd e evo B 3 B 3 Initializing the Cluster Database using the preload file B 3 Appendix C Migrating lustre eene eene mener nnne 1 C Migrating Lustre from version 1 4 to version 1 6 oie dice n es C C 1 1 PuegColtiguraiondorMIdEIs serv teeth ette C 1 C 1 2 Installation and Configuration of Lustre version 1 6 x RPMS C2 CLs Post Configuration operdfiofis i ooi ces oet USC re e TREE C2 Appendix D Manually Installing BAS5 for Xeon Additional Software D 1 Appendix E Configuring Interconnect Interfaces E The eonhdaiporbcemmadtid sedare bed iet ott ue GR OM SER bet E 1 2 Interface Description RA 2 E24 Checking theinterfaces ode d RD 2 E 2 2 Starting the InfiniBand interfaces oso ov ater tte aa E 3 Appendix F Binding Services to a Single Network
168. l 55 for Xeon distribution on a Management Node and how to use the Prepare NFS script to install the node function and product RPMs required for each type of node The software installed on the nominated Compute Login or I O Nodes is then used by Ksis a utility for image building and deployment to create a reference image that is deployed throughout the cluster to create other Compute Login or I O Nodes The term Reference Node designates the node from which the reference image is taken BASS for Xeon Installation and Configuration Guide Chapter 2 Updating BASS for Xeon v1 1 clusters to BASS for Xeon v1 2 BASS for Xeon v1 1 clusters can easily be updated to BASS for Xeon v1 2 using the configuration files that are already in place Follow the procedure described in this chapter to carry out this update See The BASS for Xeon v1 2 Software Release Bulletin for details of any restrictions which apply either to the updating procedure or for High Availability A WARNING All activity on the cluster must be stopped before carrying out the updating procedure see the BAS5 for Xeon v1 2 Software Release Bulletin for more information It is customer s responsibility to back up data and their software environment including configuration files before using the procedure described in this chapter For example the etc passwd etc shadow files root ssh directory and the home directory of the user
169. l be installed for example usr pbs 3 56 55 for Xeon Installation and Configuration Guide JUDA 3 9 9 9 Home directory The directory into which the PBS Pro daemon configuration files and log files will be installed for example var spool PBS PBS installation type The installation type depends on the type of node that PBS Professional is being installed on and are as follows the COMPUTE Node type 2 On the Login Node type This has to be a separate dedicated Login Node Do you want to continue Answer Yes You need to specify a hostname for the Server Give the hostname of the node where the PBS server has been installed normally this is the Management Node Would you like to start When the Installation complete window appears the installation program offers to start PBS Professional enter n for no Initial configuration on a COMPUTE X or LOGIN Reference Node See Chapter 3 in the PBS Professional Administrator s Guide for more information on configuring and starting PBS Professional Initial configuration on the COMPUTE X Reference Node 1 Modify the etc pbs conf file for the node as follows PBS EXEC usr pbs PBS HOME var spool PBS PBS START SERVER 0 PBS START MOM 1 PBS START SCHED 0 PBS SERVER server name 0 PBS SCP usr bin scp 2 Start PBS on the Compute Node etc init d pbs start
170. launch dmbConfig Updating BASS for Xeon v1 1 clusters to BASS for Xeon v1 2 2 5 This procedure is not fully supported for this release Contact Bull Technical Support for more information Conman Run the command below from the Management Node to force conman to start for the newly installed BASS for Xeon v1 2 cluster dbmConfig configur restart forc 2 6 Install BAS5 for Xeon v1 2 on the Reference Nodes Install the BAS5 for Xeon v1 2 software on the reference nodes for the cluster Note According to the cluster architecture reference nodes will exist for the following types of nodes COMPUTE or COMPUTEX LOGIN IO or LOGIN and I O 1 Mount NFS from the release directory on the Management Node to the release directory on the Reference Node ssh Reference Node mount t nfs Management Node IP release releas 2 Goto release XBAS5V1 2 directory cd release XBAS5V1 2 3 Execute the install command install 4 Confirm all the installation options that appear 5 Optional for clusters which use SLURM Note Check that the operations described in Section 2 2 1 have been carried out before starting the installation and configuration of SLURM 6 Install and configure SLURM on the Reference Nodes as described in STEP 5 in Chapter Installing BASS for Xeon v1 2 Software on the HPC Nodes See STEP 5 above for details on installin
171. linux i5 86 rpm 4 14 BAS5 for Xeon Installation and Configuration Guide XHPC BONUS firefox lt Bull version gt lt release gt i386 rpm b These are installed by running the commands below cd release XBAS5V1 2 XHPC BONUS rpm i jre version linux i586 rpm firefox Bull version release i386 rpm c Declare the java plugin for the newly installed Firefox version as follows cd usr lib firefox Bull version plugins 1n s usr java latest plugin i386 ns7 libjavaplugin oji so d Restart the usr bin firefox Bull version browser to take the java plugin into account 4 5 2 Complementary Configuration Tasks for EMC Clariion CX series storage devices The disk array is configured via the Navisphere Manager interface in a web browser using the following URLs http SPA ip address or http lt SPB ip address gt 1 Set the disk array name by selecting the disk array and opening the properties tab 2 Setthe security parameters by selecting the disk array and then selecting the following option in the menu bar Tools gt Security gt User Management Add a username and a role for the administrator 3 Setthe monitoring parameters as follows a Using the Monitors tab create a Monitoring template with the following parameters General tab Events General Event Severity Warning Error Critical Event Category Basic Array Feature Events SNMP Tab SNMP Management Host
172. lusterdb install preload_xxxx sql xxxx in the path above corresponds to your cluster 2 Save the database pg dump Fc C f var lib pgsql backups clusterdb dmp clusterdb Re installation of BAS5 for Xeon v1 2 with ClusterDB Preservation Note This paragraph applies when re installing an existing version of BAS5 for Xeon v1 2 with the restoration of the existing Cluster Database 1 Run the commands Su postgres psql U clusterdb clusterdb Enter Password clusterdb gt truncate config candidate truncate config status q RUNCATE ABLE RUNCATE ABLE 2 Restore the Cluster DB files which have been stored under var lib pgsql backups pg_restore Fc disable triggers d clusterdb var lib pgsql backups name of ClusterDB saved file For example name of ClusterDB saved file might be clusterdbdata 2006 1105 sav See Section 3 0 1 Saving the ClusterDB for details of the Cluster database files that have been saved See the BASS for Xeon Administrator s Guide for more details about restoring data 3 Go back to root by running the exit command Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 25 3 3 3 3 1 3 26 STEP 3 Configuring Equipment and Installing Utilities on the Management Node This step describes how to e Configure equipment e Configure Ethernet switches e Install and configure Ganglia Syslog ng NTP Postfi
173. mpiler for Intel 64 architecture formerly Intel EM64T Installation Follow the instructions contained in the Bull notice which is supplied with the Intel compiler provided by Bull and use the default path proposed by the installation routine C C Compiler for Intel 64 architecture formerly Intel EM64T Installation Follow the instructions contained in the Bull notice which is supplied with the Intel compiler provided by Bull and use the default path proposed during the installation routine Installing Intel Tools and Applications 7 1 7 3 Intel Debugger The package used to install the Intel debugger is located in either the Fortran or C tar archive Installation Follow the instructions contained in the Bull notice which is supplied with the Intel compiler provided by Bull 7 4 intel Math Kernel Library The Intel MKL libraries must be installed on Compute Extended Compute and Login Nodes Installation An installation notice is supplied with the Intel MKL provided by Bull 75 Intel Trace Tool Intel Trace Tool is supplied directly by Intel to the customer Intel Trace Tool uses the FlexLM license scheme The recommended path for installation is opt intel itac lt rel number1 gt Install it as follows cd tmp tar zxvf l_itac_ lt rel number 2 gt tar gz rel number 1 and rel number 2 represent the release numbers of the product e the installation command install sh
174. n fast interface show 4 To display the system clock clock show 5 check the hardware including serial numbers etc vital product data vpd show 8 8 Voltaire GridVision Fabric Manager For details of configuring routing using the GridVision Fabric Manager GUI see section 12 6 in the Voltaire GridVision Integrated User Manual for Grid Directors ISR 9096 and ISR 9288 and the Grid Switch ISR 9024 This is included on the Voltaire documentation CD provided 8 9 More Information on Voltaire Devices See The manuals available on the Bull Voltaire Switches Documentation CD or from www voltaire com for more information regarding other switch models and for switches which use a firmware later than version 3 x Note For more information on the SLURM Resource Manager used in conjunction with InfiniBand stacks and Voltaire switches see the BAS5 for Xeon Administrator s Guide and the BAS5 for Xeon User s Guide 8 16 55 for Xeon Installation and Configuration Guide Chapter 9 Configuring Switches and Cards This chapter describes how to configure BASS for Xeon switches and cards The following topics are described e 9 1 Configuring Ethernet Switches e 9 2 Configuring a Brocade Switch e 9 3 Configuring Voltaire Devices 9 4 Installing Additional Ethernet Boards 9 1 Configuring Ethernet Switches The Ethernet switches are configured automatically using the ClusterDB database inf
175. n software products required The products to be installed for the cluster must be listed after prod option as shown in the example below In this example all the software products will be installed cd release XBAS5V1 1 install prod XIB XLUSTRE XTOOLKIT o Lustre must use dedicated service nodes for I O functions and NOT combined Login IO service nodes NFS can be used on both dedicated O service nodes and on combined Login IO service nodes See The Bull BAS5 for Xeon Application Tunings Guide for details on configuring and using HPC Toolkit Manually Installing BAS5 for Xeon Additional Software D 1 D 2 BASS for Xeon Installation and Configuration Guide Appendix E Configuring Interconnect Interfaces First installation or reinstallation of BASS for Xeon v1 2 The configuration of the InfiniBand Interconnect interfaces is carried out automatically by Ksis when the images of the Compute and Login IO nodes are deployed Update from BAS5 for Xeon v1 1 to BAS5 for Xeon v1 2 The config ipoib command is used to configure both InfiniBand and Ethernet interfaces 1 The config_ipoib command The interconnect interface description file is generated from the Management Node for each node by using the config_ipoib command The interfaces parameters are obtained from the etc hosts file on the Management Node Different options have to be set for the config_ipoib command according to the confi
176. nagement 4 1 4 1 Enabling Storage Management Services Carry out these steps on the Management Node 1 Configure ClusterDB access information The ClusterDB access information is retrieved from the etc clustmngt clusterdb clusterdb cfg file 2 Edit the etc cron d storcheck cron file to modify the period for regular checks of the status for storage devices This will allow a periodic refresh of status info by pooling storage arrays Four 4 hours is a recommended value for clusters with tens of storage systems For smaller clusters it is possible to reduce the refresh periodicity to one 1 hour 0 2 root usr bin storcheck gt var log storcheck log 2 gt amp 1 4 2 BASS for Xeon Installation and Configuration Guide 4 2 Enabling FDA Storage System Management cont This section only applies when installing for the first time See The Bull FDA User s Guide and Maintenance Guide specific to the Store Way FDA model that is being installed and configured The management of FDA storage arrays requires an interaction with the FDA software delivered on the CDs provided with the storage arrays The Cluster management software installed on the cluster Management Node checks the FDA management software status Several options are available regarding the installation of this FDA software The FDA manager server and CLI These two components are mandatory for the integration of FDA monitoring
177. nd Configuring InfiniBand Interconnects 8 5 switchname config sm 8 2 11 Configuring Passwords Use the following procedure for configuring passwords for Exec and Privileged mode access to the RS 232 console interface and to the Ethernet management interface used for establishing a CLI session via Telnet see section 8 2 3 Starting a CLI Management Session via Telnet Note default password for Privileged mode is 123456 and for Exec mode is voltaire 1 Enter Privileged mode from Exec mode enable password 2 Set the Privileged and Exec mode passwords password update admin enable 3 Exit Privileged mode exit 8 6 BASS for Xeon Installation and Configuration Guide 8 3 Configuring Voltaire switches according to the Topology It is essential that the topology settings for the Voltaire switches are correct otherwise the performance of the cluster will suffer InfiniBand networks support 3 stage CLOS and 5 stage CLOS topologies e the network consists of a single ISR9024 DM Voltaire switch then it is not a CLOS network The topology parameter is not taken into account in this case e Ifthe network only uses ISR9024 DM Voltaire switches the topology is most likely CLOS 3 While it is technically feasible to build a CLOS 5 network using these switches it does not make much sense economically e the network only uses ISR9096 DM ISR9288 DM ISR2012 DM chassis
178. needs to be purchased separately to install the BASS for Xeon HPC Toolkit software For example use the command below to install the MNGT IO LOGIN functions with the InfiniBand software install func MNGT IO LOGIN prod XIB The install script installs the software which has been copied previously into the release directory on the NFS server hpcviewer for HPC Toolkit IF HPC Toolkit has been installed and you wish to use the hpcviewer tool on the Management Node carry out the following procedure The Java virtual machine rpm included on the RHEL5 1 Supplementary for EM 4T CDROM must be installed so that the hpcviewer tool included in HPC Toolkit can function This is done as follows 1 Goto the release RHEL5 1 Supplementary directory cd release RHEL5 1 Supplementary Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 23 3 2 9 3 2 5 1 3 24 2 Manually install the public key for verification of the Java virtual machine RPM by using the command below rpm import RPM GPG KEY redhat releas 3 Install the Java virtual machine by running a command similar to the one below yum install java virtual machine version For example yum install java 1 5 0 bea 1 5 0 08 13jpp 5 e15 See The Bull BAS5 for Xeon Application Tuning Guide for details on configuring and using Toolkit Database Configuration Please go to the
179. ng the Foundry Network switches initially with the IP address 192 168 1 200 24 or for a temporary configuration of an Ethernet switch Cisco or Foundry Pre Requisites Before an Ethernet switch can be configured ensure that the following information is available The name of the switch IP address of the switch The IP address of the Netmask Passwords for the console port and the enable mode These must be consistent with the passwords stored in the ClusterDB database 1 Connect the Console port of the switch to the Linux machine Using a serial cable connect a free serial port on a Linux machine to the CONSOLE port of the switch Make a note of the serial port number as this will be needed later 2 From the Linux machine establish a connection with the switch Connect as root Open a terminal nthe etc inittab file comment the tty lines that enable a connection via the serial port s these lines contain ttySO and ttyS1 S0 2345 respawn sbin agetty 115200 ttyso S1 2345 respawn sbin agetty 115200 ttyS1 Run the command kill 1 1 BASS for Xeon Installation and Configuration Guide Connect using one of the commands below the serial cable connects using port then run cu s 9600 1 dev ttySO fthe serial cable connects using port 1 then run cu s 9600 1 dev ttyS1 Enter no to any questions which may appear until the following messa
180. ng up NIS to share user accounts 6 1 6 1 1 Configure NIS on the Login Node NIS server 6 1 6 1 2 Configure NIS on the Compute or and the I O Nodes NIS 6 2 6 2 Configuring NFS to share the home nfs release 6 3 6 2 1 Preparing the LOGIN node NFS server for the NFSv3 file 6 3 6 2 2 Setup for NFS v3 file 6 4 6 3 Configuring the Lustre file sy det eles at uns eto cive teres aoa awe UU 6 5 6 3 1 Enabling Lustre Management Services on the Management 6 5 6 3 2 Configuring d O Resources for Bistro rhe poet eno ter sopra e pes 6 6 6 3 3 Adding Information to the etc lustre storage conf file 6 7 6 3 4 Configuring the High Availability services Lustre High Availability clusters only 6 8 6 3 5 Lustre Pre Configuration Oper alonso rip edet 6 8 6 3 6 Configuring the Lustre MGS service ener 6 9 6 3 7 Lustre Pre Configuration Checks iiiter ete E Ute iris eU e iren ioter 6 11 6 3 8 Configuring ecsetseset rone te pte e 6 12 Chapter 7 Installing Intel Tools and 7 1 7 1 Deleted cebat sos deret RATE de ode mese lieet
181. nodes for the deployment must be up This can be checked using the command below from the Management Node ksis nodelist If the status for any of the nodes is different from up then restart Nagios by running the following command from the root prompt on the Management Node service nagios restart For BASS for Xeon v1 1 clusters with High Availability for NFS 1 In the etc modprobe conf file add the options Ipfc Ipfc nodev 5 line before the lines below install lpfc modprobe i lpfc logger p local7 info t IOCMDSTAT LOAD lpfc remove lpfc logger p local7 info t IOCMDSTAT UNLOAD lpfc modprobe ir lpfc Updating BAS5 for Xeon v1 1 clusters to BASS for Xeon v1 2 2 7 2 Identify the kernel version installed on the node by running the command uname r 3 Save the old initrd image using the kernel version identified above mv boot initrd kernel version img boot initrd kernel version img orig 4 Generate a new initrd image mkinitrd v boot initrd kernel version img kernel version 2 7 2 Create an Image Create an image of each BASS for Xeon v1 2 reference node ksis create lt image_name gt lt reference_ node_name gt Example ksis create imagel nsl This command will ask for a check level Select the basic level If no level is selected the basic level will be selected automatically by default after the timeout 2 7 3 Deploy the Image
182. not already exist For Job completion JobCompType jobcomp filetxt default is jobcomp none JobCompLoc var log slurm slurm job log For accounting type for SLURM v1 0 15 use JobAcct Type jobacct linux default is jobacct none JobAcctLogFile var log slurm slurm_acct log For accounting type for SLURM v1 3 2 use JobAcctGatherType jobacct linux default is jobacct none AccountingStorageLoc var log slurm slurm_acct log Uncomment the appropriate lines if job accounting is to be included Provide the paths to the job credential keys The keys must be copied to all of the nodes JobCredentialPrivateKey etc slurm private key JobCredentialPublicCertificate etc slurm public key Provide Compute Node details Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 35 3 36 NodeName bali 10 37 Procs 8 State UNKNOWN 9 Provide information about the partitions MaxTime is the maximum walltime limit for any job in minutes The state of the partition may be UP or DOWN PartitionName global Nodes bali 10 37 State UP Default YES PartitionName test Nodes bali 10 20 State UP MaxTime UNLIMITED PartitionName debug Nodes bali 21 30 State UP 10 In order that Nagios monitoring is enabled inside NovaScale Master HPC Edition the SLURM Event Handler mechanism has to be active This means that the following line in the SLURM conf file on the Management Node has to be uncommented or added if it does not appear there
183. ntical on all nodes These directories must be restored once the installation has finished see 3 5 1 Configuring SSH 2 0 3 Saving the Storage Configuration Information The following configuration files in the etc storageadmin directory of the Management Node are used by the storage management tools It is strongly recommended that these files are saved onto a non formattable media as they are not saved automatically for a re installation e storframework conf configured for traces etc e slornode conf configured for traces etc e nec admin conf configured for any FDA disk array administration access e admin conf configured for any DDN disk array administration access admin conf configured for any OPTIMA 1250 disk array administration access e dgc_admin conf configured for any EMC Clariion DGC disk array administration access Also save the storage configuration models if any used to configure the disk arrays Their location will have been defined by the user 3 0 4 Saving the Lustre File Systems The following files are used by the Lustre system administration framework It is strongly recommended that these files are saved onto a non formattable media from the Management Node e Configuration files etc lustre directory e File system configuration models user defined location by default etc lustre models e directory if the High Availability capability is enabled var lib ldap l
184. o test nsctrl run a command similar to that below root xena0 nsctrl status 1 5 This will give output similar to that below xena2 Chassis Power is on xenal Chassis Power is on xena3 Chassis Power is on 5 Chassis Power is xena4 Chassis Power is on root xena0 2456 Checking Conman coman is a command that allows administrators to connect to the node consoles Installing BAS5 for Xeon v1 2 Software on the HPC Nodes 3 69 Usage conman OPTIONS CONSOLES It runs via the conmand deamon and the dbmConfig command is used to configure it 1 Run the command below to check the conmand demaon root xena0 service conman status conmand pid 5943 is running root xena0 2 Runacommand similar to the one below to check conman root xena0 conman 2 lt ConMan gt Connection to console xena2 opened Red Hat Enterprise Linux Server release 5 1 Tikanga Kernel 2 6 18 53 1 21 e15 Bull 1 on an x86_64 xena2 login dau Testing PBS Professional Basic setup 1 A user should be created on all the nodes for testing purposes in the examples below this is referred to as user test This is done as follows useradd g group d home login 2 The ssh keys for the user should have been dispatched to all nodes normally this will have done at STEP 5 during the installation procedure see section 3 5 1 2 for more information 3 Launch a test job f
185. ode distribution in terms of Management Nodes Compute Nodes Login Nodes I O Nodes etc before beginning any software installation and configuration operations 1 2 1 A BASS for Xeon cluster infrastructure consists of Service Nodes for management storage and software development services and Compute Nodes for intensive calculation operations BAS5 for Xeon Cluster architecture The BASS for Xeon clusters feature nodes that are dedicated to specific activities Cluster Configuration 1 1 Service Nodes are configured to run the cluster services The cluster services supported by BASS for Xeon are Cluster Management including installation configuration changes general administration and monitoring of all the hardware in the cluster Login to provide access to the cluster and a specific software development environment O to transfer data to and from storage units using a powerful shared file system service either NFS or Lustre ordered as an option Depending on the size and the type of cluster a single Service Node will cover all the Management Login and I O Node functions OR there will be several Service Nodes providing the different functions as shown in the diagrams that follow Compute Nodes are optimized for code execution limited daemons run on them These nodes are not used for saving data but instead transfer data to Service Nodes There are two types of Compute Nodes possible for Bull BAS5 for Xe
186. ode etc Hardware Management Bull Advanced Server for Xeon software suite includes different hardware management and maintenance tools that enable the operation and the monitoring of the cluster including ConMan a console management program designed to support a large number of console devices and users connected at the same time It supports local serial devices and remote terminal servers via the telnet protocol and can also use Serial over LAN via the IPMI protocol Consoles when managed by ConMan provide e Access to the firmware shell BIOS to obtain and modify NvRAM information to choose the boot parameters for the kernel for example the disk on which the node boots e Visualization of the BIOS operations for a console including boot monitoring e Boot interventions including interactive file system check fsck at boot NS Commands these may be used to configure starting and stopping operations for cluster components These commands interact with the nodes using the LAN administration network to invoke IPMI_tools and are described in the NovaScale Master Remote HW Management CLI Reference Manual Ksis is used to create and deploy software images Bull NovaScale Master HPC Edition provides all the monitoring functions for BAS5 for Xeon clusters using Nagios an open source application for monitoring the status of all the cluster s components that will trigger an alert if there is a problem NovaScale Master use
187. ode that the image has been deployed to ensuring that all the cluster nodes of a particular type are homogenous Ksis post configuration is carried out by running the command ksis postconfig run PostConfig cluster name nodelist For example ksis postconfig run PostConfig xena 1 100 Configure the Interconnect Interfaces Ethernet clusters only The interconnect Ethernet interface description file is generated from the Management Node for each node by using the config ipoib command See Appendix E Configuring Interconnect Interfaces for more details regarding the use of the config ipoib command BASS for Xeon Installation and Configuration Guide STEP 7 Final Cluster Checks Testing pdsh pdsh is a utility that runs commands in parallel on all the nodes or on a group of nodes for a cluster This is tested as follows All nodes 1 Run a command similar to that below from the Management Node as root pdsh w nova 8 10 hostname 2 This will give output similar to that below noval0 novalO nova9 nova9 nova8 nova8 Groups of nodes 1 Run the dbmGroup command dbmGroup show 2 This will give output similar to that below Group Name Description Nodes Name ADMIN Nodes by type ADMIN nova 0 12 ALL All nodes except node admin nova 1 10 Burning Burning group 5 Nodes by type COMP nova 1 4 7 8 COMP128GB COMPUTE node with 128GB nova8 COMP 48
188. oer out 7 1 7 2 Intel Compilers qc tcu ua detta osa ied ert bes tuit is 7 1 7 2 1 Fortran Compiler for Intel 64 architecture formerly Intel 7 1 7 2 2 Compiler for Intel 64 architecture formerly Intel EM64T 7 1 7 3 Intel Debugger ciae attt usate oA ead de etos 7 2 7 4 Intel Math Kernel Library MKL Sc saco ERROR ERREUR Pe T ted 7 2 7 5 Intel Trace ei deat eee So anteater fetus 7 2 7 6 Updating Intel Compilers and BASS for Xeon v1 2 7 3 Chapter 8 Installing and Configuring InfiniBand Interconnects 8 1 8 1 Installing 400 and Mellanox ConnectX Interface 8 1 8 2 Configuring the Voltaire ISR 9024 Grid SwIfel oto iret e ev bas teh 8 2 B 19 Connecting toa Cons les rescues sie us ia br aid bdo eden table be 82 8 2 2 Starting a CLI Management Session using a serial line 8 2 viii BASS for Xeon Installation and Configuration Guide 8 2 3 Starting a CLI Management Session via Telnet 8 2 8 2 4 Configuring the Time and Dies sse doce einem crt sy de sont Ebenen ten oi 8 3 8 2 5 4 Hostname setup tek ui ang eU ERG 8 3 8205 Networking Setup stacy adware bebat aes 8
189. on e Minimal Compute or COMPUTE Nodes which include minimal functionality are quicker and easier to deploy and require less disk space for their installation These are ideal for clusters which work on data files non graphical environment e Extended Compute or COMPUTEX Nodes which include additional libraries and require more disk space for their installation These are used for applications that require a graphical environment X Windows and also for most ISV applications They are also installed if there is a need for Intel Cluster Ready compliance 1 2 2 Different architectures possible for BAS5 for Xeon 12 21 Small Clusters On small clusters all the cluster services Management Login and I O run on a single Service Node as shown in Figure 1 1 1 2 BASS for Xeon Installation and Configuration Guide Service Node Compute Nodes InfiniBand gt or gt 2 Croacie Ethernet e Interconnects 1 Management Login IO Node Figure 1 1 Small Cluster Architecture 1222 Medium sized Clusters On medium sized clusters one Service Node will run the cluster management services and a separate Service Node will be used to run the Login and I O services Service Nodes Compute Nodes e_ nea d InfiniBand or d Management Node Ethernet a Interconnects Login IO Node Figure 1 2 Medium sized Cluster Architecture Cluster Configuration 1 3 1 2 2 3 Large clusters On large cluste
190. onality 5 Configuring the MPI User environment Installing RHEL5 1 BAS5v1 2 for Xeon Software and optional HPC software products on other nodes 1 Specifying the software and the nodes to be installed Page 3 42 2 Running the preparenfs script Configuring Administration Software on Login 1 O COMPUTE and COMPUTEX Nodes 1 Installing and configuring ssh Kdump SLURM and PBS Pro as necessary f 3 49 2 Installing compilers on Login Nodes 3 Configuring the MPI User environment A Installing RAID monitoring software optional Creating an image and deploying it on the cluster nodes using Ksis 1 Installation and configuration of the image server 2 Creation of the image of a COMPUTE X or LOGIN Node previously Page 3 61 installed 3 Deployment of this image on cluster nodes STEP 7 Final Cluster Checks Page 3 65 3 2 BASS for Xeon Installation and Configuration Guide 3 0 Pre installation Backup Operations when Re installing BAS5 for Xeon v1 2 This step describes how to save the ClusterDB database and other important configuration files Use this step only when re installing BAS5 for Xeon v1 2 where the cluster has already been configured or partially configured and there is the need to save and reuse the existing configuration files Skip this step when installing for the first time WARNING The Operating System will be installed from scratch erasing all disk contents in the process
191. onfiguring FDA Access Information from the Management Node 1 Obtain the Linux or Windows host user account and the iSM client user and password which have been defined All the FDA arrays should be manageable using a single login password 2 Edit the etc storageadmin nec_admin conf file and set the correct values for the parameters On Linux iSMpath opt iSMSMC bin iSMcmd On Windows iSMpath cygdrive c Program Files FDA iSMSM_CMD bin iSMcmd iSMpath opt iSMSMC bin iSMcmd iSMpath cygdrive c Program Files FDA iSMSM_CMD bin iSMcmd NEC iStorage Manager host Administrator hostadm administrator EC iStorage Manager administrator login necadmin admin EC iStorage Manager administrator password necpasswd password 4 2 3 Initializing the FDA Storage System 1 Initialise the storage system using the maintenance port MNT The initial setting must be done through the Ethernet maintenance port MNT using the Internet Explorer browser Refer to the documentation provided with the FDA storage system to perform the initial configuration 4 6 BASS for Xeon Installation and Configuration Guide E bs The IP addresses of the Ethernet management LAN ports must be set according to the values predefined in the ClusterDB storstat d n fda name i H Carry out the following post configuration operations using the iSM GUI Start the iSM GUI and verify that the FDA has been dis
192. or singlet 2 lt ddn_name gt ddn name sl ddn name s2 Console name for singlet 1 ddn name s1s Console name for singlet 2 lt name s2s IP names and associated IP address are automatically generated in the etc hosts directory The conman consoles are automatically generated in the etc conman conf file Otherwise refer to the dbmConfig command Initializing the DDN Storage System Initialize each DDN storage system either from the cluster Management Node or from a laptop as described below Initialization from a Cluster Management Node with an existing Serial Interface between the Management Node and the DDNs Check that ConMan is properly configured to access the serial ports of each singlet conman lt console name for the singlet gt When you hit return a prompt should appear ddn_init command The ddn_init command has to be run for each DDN The target DDN system must be up and running with 2 singlets operational The serial network and the Ethernet network must be properly cabled and configured with ConMan running correctly to enable access to both serial and Ethernet ports on each singlet Notes e ddn init command is not mandatory to configure DDN storage units The same configuration can be achieved via other means such as the use of DDN CLI ddn_admin or DDN telnet facilities to configure other items e init comman
193. or root admin password The same logins are defined in the etc storageadmin ddn_admin conf file iSM iSM Change to admin and password to NEC Storage match logins defined in the subsystems etc storageadmin nec_admin conf file admin password The same logins are defined in the pons ds 299 etc storageadmin xyr_admin conf orage subsystems file EMC DGC User defined User defined It is recommended to use admin and ADF Storage systems at the first atthe first password in the same way as for ge sy connection connection other systems Default Logins for different cluster elements A 1 A 2 BASS for Xeon Installation and Configuration Guide Appendix B Cluster Database Operations B 1 Migrating to 55 for Xeon v1 2 B 1 1 Migrating Cluster DB Data from BASS for Xeon v1 1 The Cluster Database data will be migrated automatically when an existing BASS for Xeon v1 1 cluster is upgraded to BASS for Xeon v1 2 See Chapter 2 Updating BASS for Xeon v1 1 clusters to BASS for Xeon v1 2 in this manual for a description of the upgrade procedure for clusters without any form of High Availability B 1 2 Migrating Cluster DB Data from BASA for Xeon v1 2 AN WARNING All working activity for the cluster must have been stopped before migrating the cluster database on the Management Node 1 Log on as root on the BAS4 for Xeon v1 2 Management Node and install the clusterdb data ANYA 12 20 4 1
194. ormation and the configuration file see section 9 1 5 Ethernet Switches Configuration File Prerequisites e Management Node must be installed In particular the Ethernet interface of the Administration Network and its alias must be configured and the netdisco package installed e ClusterDB database must be preloaded and reachable e CISCO switches must remain as configured initially factory settings Foundry Network switches must have the default IP address preinstalled see section 9 1 6 Ethernet Switches Initial Configuration 9 1 1 Ethernet Installation scripts The tool is supplied in the form of a RPM package ethswitch tools1 0 0 Bull noarch rpm on the Cluster Management CD It should be installed on the Management Node This package includes the following scripts usr sbin swtAdmin The main script used to install switches usr sbin swtConfig A script that enables configuration commands to be run on the switches Also the package includes the usr lib clustmngt ethswitch tools directory which contains the following directories bin Perl scripts called by the swtAdmin main script lib libraries required to execute the scripts data The configuration file and DTD files Configuring Switches and Cards 9 9 1 2 2 3 9 2 swtAdmin Command Option Details usr sbin swtAdmin auto step by step generatel preinstall Example netdisco mac update install save clear switch numb
195. ory before continuing cp r etc lustre somewhere on the management node lustre bkp Migrating Lustre C 1 WARNING The directory where these backup files are copied to must not be lost when the Management Node is reinstalled T Installation and Configuration of Lustre version 1 6 x RPMS 1 Mount NFS from the release directory on the Management Node to the release directory on the Lustre Service Node ssh Service Node mount t nfs Management Node IP release releas 2 Install the XLustre software as shown below cd release XBAS5V1 2 install prod XLUSTRE 3 Configure the new version of Lustre as detailed in the Configuring Lustre section in Chapter 6 in the BASS for Xeon Installation and Configuration Guide gt Stop at the Install the file system step as the Lustre configuration details and data will have been saved previously C 1 3 Post Configuration operations 1 After installing the Lustre version 1 6 packages copy the contents of the backed up lustre bkp directory into etc lustre cp r somewhere on the management node lustre bkp etc lustre 2 Check the new lustre cfg file contains the new MGS related directives i e LUSTRE MGS HOST LUSTRE MGS NET LUSTRE MGS ABSOLUTE LOOPBACK FILENAME 3 From the Management Node run the clean extents on dirs sh script on all Lustre file systems to remove version 1 4 extents these are not supporte
196. ostname It can be checked as follows switchname config names system name show Networking setup The following section describes how to set up the switch IP address for the Ethernet interface The configuration of the IP address over the Infiniband network is not described Installing and Configuring InfiniBand Interconnects 8 3 Note default IP address for a Voltaire switch is 192 168 1 2 If the switch cannot be reached using this address then use a serial line speed 38600 no parity 1 stop bit no flow control 8 2 5 1 Networking configuration menu Enter the networking configuration menu as follows ssh enable switchname enable switchname s password voltaire Welcome to Voltaire Switch switchname Connecting switchname config switchname config interface fast switchname config if fast 8 2 6 2 Determining the current IP setup The IP address that is currently configured can be seen as follows switchname config if fast 4 ip address fast show fast interface ip is 172 20 2 20 ip mask is 255 255 0 0 broadcast ip is 172 20 255 255 management interface is ethl link speed is auto negotiation The DHCP client is disabled 8 2 7 Setting up the switch IP address The switch IP address is set as follows switchname config if fast ip address fast set ip address gt network mask Also make sure that the broadcast address is configured properly s
197. p NETMASK pdsh w node n m cat etc sysconfig network scripts ifcfg ibO0 grep BROADCAST pdsh w node n m cat etc sysconfig network scripts ifcfg ib0 grep NETWORK pdsh w node n m cat etc sysconfig network scripts ifcfg ibO0 grep ONBOOT Reconfigure those settings where the values returned by these commands do not match what is required for the cluster E23 Starting the InfiniBand interfaces The following commands are used to load all the modules and to start all the InfiniBand interfaces on each node etc init d openibd start or service openibd start These commands have to be executed for each node individually Note node reboot may be used to load the InfiniBand modules automatically Configuring Interconnect Interfaces E 3 E 4 BASS for Xeon Installation and Configuration Guide Appendix F Binding Services to a Single Network The bind attribute in the etc xinetd conf file is used to bind a service to a specific IP address This may be useful when a machine has two or more network interfaces for example a backbone computer which is part of a cluster administration network and is at the same time connected to the customer LAN through a separate interface In this situation there may be backbone security concerns coupled with a desire to limit the service to the LAN For example to bind the ftp service to the LAN the etc xinetd conf file has to be configured as follows
198. password admin when requested switch delete flash config text Answer the default questions ENTER switch reload Confirm without saving ENTER Ignore the question Would you like to enter the initial configuration dialog yes no and disconnect Foundry Network Switches Foundry Network switches must be configured with the IP address 192 168 1 200 24 1 Erase the configuration From a serial or Ethernet connection enter the following commands switch gt enable Enter the password admin when requested switch erase startup config Answer the default questions ENTER switch reload Confirm without saving ENTER 2 Configure the 192 168 1 200 24 IP address FLS648 Switch gt enable No password has been assigned yet Configuring Switches and Cards 9 7 9 1 7 9 8 FLS648 Switch configure terminal FLS648 Switch config a on Fastlron FLS624 or FLS648 models 15648 Switch config ip address 192 168 1 200 255 255 255 0 15648 Switch config end LS648 Switch write mem b on Biglron RX4 RX8 and RX16 models RX Switch config vlan 1 RX Switch config vlan 1 f router interface ve 1 RX Switch config vlan 1 interface ve 1 RX Switch config vif 1 ip address 192 168 1 200 255 255 255 0 RX Switch config vif 1 end RX Switch write mem Basic Manual Configuration Please use this method when configuri
199. quested Set the enable mode enable Enter the password when requested Display the configuration with the show configuration command An example is shown below show configuration Using 2407 out of 65536 bytes version 12 2 no service pad Service timestamps debug uptime Service timestamps log uptime no service password encryption hostname eswu0cl enable secret 5 1 1jvR vnD1S KOUDA4tNmIm zLTl1 no aaa new model ip subnet zero no file verify auto Spanning tree mode pvst spanning tree portfast default spanning tree extend system id vlan internal allocation policy ascending interface GigabitEthernet0 1 interface GigabitEthernet0 2 ree GigabitEthernet0 3 iue acd GigabitEthernet0 4 t interface GigabitEthernet0 5 1 GigabitEthernet0 6 1 intertsc GigabitEthernet0 7 I interface GigabitEthernet0 8 1 interface GigabitEthernet0 9 1 interface GigabitEthernet0 10 interface GigabitEthernet0 11 interface GigabitEthernet0 12 interface GigabitEthernet0 13 Configuring Switches and Cards 9 11 interface GigabitEthernet0 14 interface GigabitEthernet0 15 interface GigabitEthernet0 16 Interface GigabitEthernet0 17 interface GigabitEthernet0 18 1 interface GigabitEthernet0 19 1 interface GigabitEthernet0 20 faeries GigabitEthernet0 21 interface GigabitEthernet0 22 1 interface GigabitEthernet0 23 1 interface Gigabit
200. r backbone Automatic Installation andConfiguration of the Ethernet Switches The Ethernet switches can be configured automatically by running the command swtAdmin auto All the steps 1 6 below in the Ethernet Switch Configuration Procedure are executed in order with no user interaction If the automatic installation fails at any stage you will only need to execute the steps which remain including the one that failed BASS for Xeon Installation and Configuration Guide Alternatively the switches can be installed and configured interactively by using the command below swtAdmin step by step switch number number of new switches All the installation and configuration steps 1 6 are executed in order but the user is asked to continue after each one 9 1 4 Ethernet Switch Configuration Procedure 1 Generating Configuration Files There are two kinds of configuration files 1 files for the temporary configuration of the network and DHCPD services on the Service Node and 2 configuration files for the switches The switch configuration files are generated by running the command swtAdmin generate dbname database name netaddress network ip for temporary config netmask netmask for temporary configuration logfile lt logfile name network lt admin backbone gt verbose help While this command is being carried out the following message will app
201. r cut pasted into a text editor if the configuration details need to be modified Whether generated manually or by the configurator html tool the slurm conf file must contain the following information 1 The name of the machine where the SLURM control functions will run This will be the Management Node and will be set as shown in the example below ControlMachine lt basename gt 0 3 34 5 for Xeon Installation and Configuration Guide Cont rolAddr lt basename gt 0 The SlurmUser and the authentication method for the communications SlurmUser slurm AuthType auth munge as shown in the example file or AuthType auth none The type of switch or interconnect used for application communications SwitchType switch none used with Ethernet and InfiniBand Any port numbers paths for log information and SLURM state information If they do not already exist the path directories must be created on all of the nodes SlurmctldPort 6817 SlurmdPort 6818 SlurmctldLogFile var log slurm slurmctld log SlurmdLogFile var log slurm slurmd log h StateSaveLocation var log slurm log_slurmctld SlurmdSpoolDir var log slurm log_slurmd Provide scheduling resource requirements and process tracking details SelectType select linear SchedulerType sched builtin default is sched builtin ProctrackType proctrack pgid Provide accounting requirements The path directories must be created on all of the nodes if they do
202. r rebooting the system Release Notes Back amp Next Figure 3 16 Installation screen 10 Click on Next in Figure 3 15 to begin the installation of Red Hat Enterprise Linux Server 11 When the Congratulations the installation is complete screen appears carry out the procedure below to avoid problems later There may be problems with the graphic display the bottom part of the screen does not appear on some machines a Hold down the Ctrl Alt F2 keys to go to the shell prompt for console 2 b Save the xorg conf file by using the commands below cd mnt sysimage etc X11 Cp p xorg conf xorg conf orig c Edit the xorg conf file by using the command below vi mnt sysimage etc X11 xorg conf d Go to the Screen section subsection Display and after the Depth 24 line add the following line Modes 1024x768 832x624 e Save the file and exit vi f Confirm that the modifications have been registered by running the command diff xorg conf orig xorg conf 3 16 55 for Xeon Installation and Configuration Guide This will give output similar to that below gt Modes 1024x768 832x624 g Check the screen appearance is OK by holding down the Ctrl Alt F6 keys h Click on the Reboot button Note screen resolution be changed if there are any display problems by holding down Ctrl Alt or Ctrl Alt on the keyboard 3 1 8 First boot settings 12
203. r wish to preserve the existing data BASS for Xeon Installation and Configuration Guide 3 1 4 2 RED HAT ENTERPRISE LINUX 5 Installation requires partitioning of your hard drive By default a partitioning layout is chosen which is reasonable for most users You can either choose to use this or create linu paii O You have chosen to remove all Linux partitions and ALL DATA on them on the following drives Select the drive dev sda you sure you want to do this Review and modify partitioning layout Belease Notes Back Next Figure 3 7 Confirmation of the removal of any existing partitions Select Yes to confirm the removal of any existing partitions as shown in Figure 3 7 if this screen appears If the default partitioning is to be left in place go to section 3 1 5 Network access Configuration Reinstallation using the existing partition layout a RED HAT ENTERPRISE LINUX 5 Installation requires partitioning of your hard drive By default a partitioning layout is chosen which is reasonable for most users You can either choose to use this or create your own Remove linux partitions on selected drives and create default layout Select the drive s to use for this installation Advanced storage configuration Release Notes
204. ra Control 72 2 ore Enabled t Comro c State ete ladssssssssssssssssssdsssqqdcqsddssdsddsdddJ nx T ZO onmes 1 Fm mu ny lt 6 gt Reset to C IIZIZIZZZIIZZIIZZIZIZZZZIZZIIIZZIZIZZIZIZZZIZIZIZIZIZIZZZZZIIZZIZZIZIZIZIZIZIZIZIZIZIZZIIZIIII Figure G 13 SMC AOC USAS S8iR Controller settings 12 Once all the settings are in place press Escape to exit and select PHY Configuration from the Options menu see Figure G 12 13 Check the Physical Layer settings that are in place see Figure below The settings below are examples only Figure G 14 SAS PHY Settings Configuring AOC USASLP S8iR RAID Adapters for NovaScale R423 and R425 machines G7 14 Press Escape several times until the Options menu appears select Disk Utilities as shown below SHC AOCHUSAS S8iR LP Controller 0 Bik RIDE Figure G 15 RAID Configuration Utility Options Menu gt Disk Utilities 15 Check that all the drives are present see Figure below SHC AOCHISAS SBiIR LP Controller and press press lt Enter gt Hodel it FUJITSU HAX3147RC 0104 0 00 510 01 FUJITSU HAX3147R 0104 O Slot 3 NII250015 015 70 06L06 Ms 255 E MDC 5 0 12 01 01 3 06 465 HIL MII250015 015 20 06006 1 56 233 76 MII250015 015 0 06006 3 06 233 6
205. ration file In order that the automated commands run over ssh sudo the installer will have modified the default configuration file by commenting out this option Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 49 Copy the etc hosts file onto the Reference Node Copy the etc hosts file from Management Node using the scp command with the IP address of the Management Node as the source parameter Example scp root lt Management_Node_IP_address gt etc hosts etc hosts 3 5 1 2 For a first installation of BASS for Xeon v1 2 On the COMPUTE COMPUTEX and combined LOGIN I O or dedicated LOGIN and I O Reference Nodes 1 Copy the root ssh directory from the Management Node on to the Reference Nodes scp r ssh lt reference_node gt 2 Test this configuration gt ssh lt reference_node gt unam The authenticity of host nsl 127 0 0 1 can t be established RSA key fingerprint is 91 7e 8b 84 18 90c 93 92 42 32 4a d2 9 38 e69 fc Are you sure you want to continue connecting yes no yes Warning Permanently added ns1 127 0 0 1 RSA to the list of known hosts Linux Note With this SSH configuration no password is required for root login from the Management Node to the other HPC nodes Copy the etc hosts file onto the Reference Node Copy the etc hosts file from Management Node using the scp command with the IP address of the Management Node as the source parameter Example
206. rdb cfg 4 2 Commands config ipoib 3 64 E 1 preparenfs D 1 rhnreg ks 1 1 Compilers Fortran 7 1 7 2 installation 7 1 Intel 7 1 config ipoib command 3 64 E 1 configuration Ganglia 3 31 3 50 Lustre file system 6 5 network 3 17 NTP 3 32 overview 3 2 postfix 3 29 switches 9 1 Configuring FTP 8 13 Conman 1 14 D database dump 3 25 initialization 3 24 register storage information 4 1 ddn_admin conf file 4 8 ddn_init command 4 10 ddn_set_up_date_time cron file 4 8 debuggers Intel installation 7 2 disk partitioning 3 10 E Ethernet adapters E 2 F fcswregister command 4 18 FDA Storage Systems Configuring 4 3 GUI Client 4 3 iSMsvr conf file 4 4 Linux ssh access 4 5 Linux Systems 4 4 Storage Manager server 4 4 File Group csv 8 14 Fortran installation 7 1 7 2 fsck 1 14 fstab file 3 20 Index 1 1 G Ganglia configuration 3 31 3 50 gmetad conf file 3 31 gmond conf file 3 31 3 50 golden image creating 3 62 H HCA 400 Ex D Interface 8 1 InfiniBand 8 1 InfiniBand adapters E 2 InfiniBand interfaces Configuring E 1 Infiniband Networks 1 10 installation Ksis server 3 61 management Node 3 5 overview 3 2 Intel debugger installation 7 2 Intel libraries 7 1 Intel Trace Tool installation 7 2 intelruntime cc fc rpm 7 1 ISR 9024 Grid Switch 8 2 K Ksis server installation 3 61 L Linux rdesk
207. re changes to the WWPN then a new patch will have to be distributed on all the cluster nodes Another option is to copy the etc wwn file on the target nodes using the pdcp command pdcp w target nodes etc wwn etc Configuration of Storage Management 4 17 4 8 Enabling Brocade Fibre Channel Switches 4 8 1 Enabling Access from Management Node The ClusterDB is preloaded with configuration information for Brocade switches Refer to the fc switch table If this is not the case then the information must be entered by the administrator Each Brocade switch must be configured with the correct IP netmask gateway address switch name login and password in order to match the information in the ClusterDB Please refer to Chapter 9 for more information about the switch configuration You can also refer to Brocade s documentation 4 8 2 Updating the ClusterDB When the Brocade switches have been initialized they must be registered in the ClusterDB by running the following command from the Management Node for each switch fcswregister n fibrechannel switch name 4 18 55 for Xeon Installation and Configuration Guide Chapter 5 Configuring O Resources for the Cluster The configuration of O resources for the cluster consists of two phases Phase 1 The configuration of the storage systems e Definition of the data volumes LUNs with an acceptable fault tolerance level RAID e Configuration of the data a
208. ript has been used Completing the Configuration of SLURM on the Management Node Manually These manual steps must be carried out before SLURM is started on any of the cluster nodes The files and directories used by SLURMCTLD must be readable or writable by the user SlurmUser the SLURM configuration files must be readable the log file directory and state save directory must be writable Create a SlurmUser The SlurmUser must be created before SLURM is started The SlurmUser will be referenced by the slurmctld daemon Create a SlurmUser on COMPUTE X Login IO or LOGIN Reference nodes with the same uid gid 105 for instance groupadd g 105 slurm useradd u 105 g slurm slurm mkdir p var log slurm chmod 755 var log slurm The gid and uid numbers do not have to match the one indicated above but they have to be the same on all the nodes in the cluster The user name in the example above is slurm another name can be used however it has to be the same on all the nodes in the cluster Configure the SLURM job credential keys as root Unique job credential keys for each job should be created using the openssl program These keys are used by the slurmctld daemon to construct a job credential which is sent to the srun command and then forwarded to slurmd to initiate job steps E b openssl must be used not ssh genkey to construct these keys 3 38 55 for Xeon Installation and Configuration Gu
209. rom either the Management Node or the Login Node as the test user using a command similar to that below echo sleep 60 usr pbs bin qsub 4 Check the execution of the job using the qstat command as shown below qstat an 5 This will give output in format similar to that below nova0 Req d Req d Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time 0 novaO user test workq STDIN 8424 1 1 R 0 00 nova8 0 6 Once job has finished check that no errors are listed in the output as in the example below 3 70 55 for Xeon Installation and Configuration Guide cat STDIN oO 7 If there are problems run the tracejob command so that the problem can be identified tracejob lt job_ID gt This will give output similar to that below where no errors are reported O n 07 17 20 07 17 20 07 17 20 07 17 20 hosts 07 17 20 07 17 20 07 17 20 substate 07 17 20 ova0 08 16 24 08 16 24 08 16 24 user test nova0 08 16 24 nova8 ncpu 08 16 24 08 16 24 08 16 25 42 08 16 25 S Considering job to run enqueuing into workq state 1 hop 1 Job Queued at request of bench nova0 name STDIN queue workq Job Run at request of Scheduler nova0 on owner Job Modified at request of Scheduler nova0 Job run Obit received momhop 1 serverhop 1 state 4 Exit status 0 resources used cpupercent 0 resources used cput 00 00 00 re
210. rs the cluster management services run on dedicated nodes The Login and O services also run on separate dedicated nodes Clusters which use the Lustre parallel file system will need at least two separate Service Nodes dedicated to it Service Nodes Compute Nodes Management Node InfiniBand ee Ethernet d Login Node s Interconnects 1 0 Node s Figure 1 3 Large Cluster Architecture 1 2 3 Service node s Bull NovaScale R423 2 socket Xeon servers R440 and R460 and can all be used for the Service Nodes for Bull BAS5 for Xeon v1 2 Clusters 1 4 BASS for Xeon Installation and Configuration Guide NovaScale R423 servers Figure 1 4 NovaScale R423 server NovaScale R423 servers are double socket dual or quad core machines that support SAS and SATA2 3 5 inch storage disks NovaScale R440 servers ees ee Figure 1 5 NovaScale R440 server NovaScale R440 servers are double socket dual core machines that support SATA 3 5 SAS 2 5 and SAS 3 5 storage disks NovaScale R460 servers Figure 1 6 NovaScale R460 server Cluster Configuration 1 5 1 2 3 1 1 2 3 2 1 2 3 3 NovaScale R460 servers are double socket dual core machines that support SAS and SATA2 storage disks Note From this point onwards the Service Node running the management services will be known as the Management Node For small clusters as explained this node may also include Login and I
211. rt internal NVIDIA Tesla C1060 accelerator cards 1 8 BASS for Xeon Installation and Configuration Guide Figure 1 11 NovaScale R425 machine Figure 1 12 NVIDIA Tesla C1060 internal graphic card NovaScale R480 E1 servers Bull NovaScale R480 E1 servers are quad socket dual or quad core machines Figure 1 13 NovaScale R480 E1 machine Cluster Configuration 1 9 1 2 5 Leos 1 2 5 2 1 2 6 1 2 6 1 1 10 Networks The cluster contains different networks dedicated to particular functions including e Administration Network e High speed interconnects consisting of switches and cable boards to transfer data between Compute Nodes and I O Nodes Administration Network The Administration network uses an Ethernet network for the management of the operating system middleware hardware switches fibre channel cabinets etc and applications from the Management Node Note An optional Ethernet link is necessary to connect the cluster s Login Node s to a LAN backbone that is external to the cluster This network connects all the LAN1 native ports and the BMC for the nodes using a 10 100 1000 Mb s network This network has no links to other networks and includes 10 100 1000 Mb s Ethernet switch es Backbone The Backbone is the link between the cluster and the external world This network links the Login Node to the external network through a LAN network via Ethernet switches For per
212. s Ethernet Gigabit Networks BASS for Xeon Ethernet Gigabit networks can use either CISCO or FOUNDRY switches as follows Cisco Switches Host Channel Adapter will use one of the two native ports for each node e Clusters with less than 288 ports will use Cisco Catalyst 3560 24 Ethernet 4 SFP ports 48 Ethernet 4 SFP ports switches e Clusters with more than 288 ports will use a hierarchical switch structure based on the node switches described above and with the addition of Cisco Catalyst 650x top switches x 3 6 9 13 which provide up to 528 ports Foundry Switches BASS for Xeon supports two Fastlron LS base model switches LS 624 and LS 648 and the BIGIRON RX 4 RX 8 and RX 16 layer 2 3 Ethernet switch rack e Fastlron LS 624 supports twenty four 10 100 1000 Mbps RJ 45 Ethernet ports Four ports are implemented as RJA5 SFP combination ports in which the port may be used as either a 10 100 1000 Mbps copper Ethernet port or as a 100 1000 Mbps fiber port when using an SFP transceiver in the corresponding SFP port The Fastlron LS 624 includes three 10 Gigabit Ethernet slots that are configurable with 10 Gigabit Ethernet single port pluggable modules Cluster Configuration 1 11 lev 1 12 e Fastlron LS 648 supports forty eight 10 100 1000 Mbps RJ 45 Ethernet ports Four of these ports are implemented as RJ45 SFP combination ports in which the port may be used as either a 10 100 1000 Mbps
213. s See STEP 3 in the Chapter 3 in this manual regarding the use of the SLURM configurator html file to generate the new slurm conf file 2210 SLURM User Scripts User scripts that previously invoked srun allocate attach and batch mode options in SLURM version 1 1 19 will have to be modified as these options have been removed and now exist separately as the salloc sattach and sbatch commands in SLURM version 1 3 2 See The What s New chapter in the Software Release Bulletin for BAS5 for Xeon V1 2 for details of the version 1 3 2 changes for SLURM 2 2 2 BASS for Xeon v1 1 Configuration files Syslog ng conf The BASS for Xeon v1 1 syslog ng conf file must be saved on an external back up device as this will be used later before BASS for Xeon v1 2 is installed The BASS for Xeon v1 1 syslog ng conf file will be overwritten when 5 for Xeon v1 2 is installed 23 Pre installation Operations for 55 for Xeon v1 2 XHPC Software 1 Create the directory for the BAS5 for Xeon v1 2 XHPC software on the Management Node mkdir p release XBAS5V1 2 2 Insert the BASS for Xeon v1 2 XHPC DVD ROM into the DVD reader and mount it Updating BAS5 for Xeon v1 1 clusters to BASS for Xeon v1 2 2 3 2 4 2 4 mount dev cdrom media cdrecorder 3 Copy the BASS for Xeon v1 2 XHPC DVD ROM contents into the release directory cp a media cdrecorder release XBAS5V1 2 4
214. s Ganglia a second open source tool to collect and display performance statistics for each cluster node graphically BASS for Xeon Installation and Configuration Guide 13 2 13 21 1 3 2 2 1 3 2 3 Program Execution Environment Resource Management Both Gigabit Ethernet and InfiniBand 5 for Xeon clusters can use the SLURM Simple Linux Utility for Resource Management open source highly scalable cluster management and job scheduling program SLURM allocates compute resources in terms of processing power and Compute Nodes to jobs for specified periods of time If required the resources may be allocated exclusively with priorities set for jobs SLURM is also used to launch and monitor jobs on sets of allocated nodes and will also resolve any resource conflicts between pending jobs SLURM helps to exploit the parallel processing capability of a cluster See The BASS for Xeon Administrator s Guide and User s Guide for more information on SLURM See Parallel processing and MPI libraries A common approach to parallel programming is to use a message passing library where a process uses library calls to exchange messages information with another process This message passing allows processes running on multiple processors to cooperate Simply stated a MPI Message Passing Interface provides a standard for writing message passing programs A MPI application is a set of autonomous processes each one running
215. s must be saved All the data must be saved onto a non formattable media outside of the cluster It is recommended to use the tar or cp a command which maintains file permissions 2 1 BASS for Xeon v1 1 Files 2 2 High Availability For clusters which include some form of High Availability this chapter must be used in conjunction with the BASS for Xeon High Availability Guide For example if your cluster includes High Availability for the Lustre file system refer to the chapter in the High Availability Guide which refers to the configuration of High Availability for Lustre as well as this chapter el The BAS5 for Xeon v1 1 haionfs conf file for NFS clusters is overwritten when upgrading to BAS5 for Xeon v1 2 The High Availability packages for NFS will need to be reinstalled and the haionfs conf file edited as described in Chapter 9 in the High Availability Guide Updating BASS for Xeon v1 1 clusters to BASS for Xeon v1 2 2 1 2 2 1 22 0 1 2 2 1 2 2 4 1 4 2 2 Optional for SLURM clusters only SLURM state files WARNING All jobs that are running should be saved and backed up before they are cancelled SLURM state files for version 1 3 2 are different from those for version 1 1 19 This means that it will not be possible to reuse previously saved job and node state information from version 1 1 19 Therefore all version 1 1 19 jobs must be cancelled cleanly before upgrading to version 1 3 2
216. s opt altair security altair_lic dat Would you like to start When the Installation complete window appears the installation program offers to start PBS Professional enter n for no Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 41 3 3 13 4 See Initial configuration on the Management Node e Chapter 4 in the PBS Professional Installation and Upgrade Guide available on the PBS Professional CD ROM for more information on the installation for PBS Professional e Chapter 2 in the PBS Professional Administrator s Guide for more information on configuring PBS Professional 2 See the 5 for Xeon High Availability Guide for the pbs conf file configuration details if High Availability is in place for PBS Professional 3 3 13 5 2 9 14 3 3 15 1 Modify the etc pbs conf file as follows PBS_EXEC usr pbs PBS_HOME var spool PBS PBS_START_SERVER 1 PBS_START_MOM 0 PBS_START_SCHED 1 PBS_SERVER basename0 PBS_SCP usr bin scp Starting PBS Professional Run the PBS start script using a command with the following format lt path to script gt pbs start for example etc init d pbs start Installing Intel Compilers and Math Kernel Library Install the Intel Compilers and Math Kernel Library if required on a Service Node if it includes BOTH the Management services and the Login services Intel MKL is included with the
217. s on the initial configuration A Windows laptop and a RS232 cable will be required The initialization parameters are saved in the cluster database da ethernet port table and can be retrieved as follows 1 Run the command below to see the EMC Clariion storage system information defined in the cluster management database storstat a grep DGC This command will list the DGC disk arrays to be configured on the cluster 2 For each DGC storage system retrieve the IP addressing information by using the command below storstat d n dgc name i H 3 For each Service Processor SPA and SPB of each CX3 40f set the IP configuration parameters for the address Hostname for SPA dgc name O for SPB dgc name 1 Subnet Mask Gateway Peer IP address IP address of the other SP of the same DGC disk array Once these settings have been made the Service Processor will reboot and its IP interface will be available 4 The Java and Firefox plugins are installed and linked by default so that the http interface for the EMC Navisphere Management Suite can be used for the complementary configuration tasks Start the Firefox browser by running the command usr bin firefox Bull version However if there is a problem follow the procedure defined below to install these plugins a Install the following 2 RPMs from the BONUS directory on the Bull XHPC DVD XHPC BONUS ire version
218. section below that corresponds to your installation and follow the instructions carefully e First Installation Initialize the Cluster Database e Reinstallation of BASS for Xeon v1 2 with ClusterDB Preservation First Installation Initialize the Cluster Database Note This paragraph applies only when performing the first installation of BASS for Xeon v1 2 and the cluster has been delivered with no Cluster DB preloaded by Bull Contact Bull Technical Support to obtain the Cluster DB preload file 1 Run the following commands the IP addresses and netmasks below have to be modified according to your system Su postgres cd usr lib clustmngt clusterdb install loadClusterdb basename clustername adnw xxx xxx 0 0 255 255 0 0 bknw xxx xxx 0 0 255 255 0 0 bkgw ip gateway bkdom domain name icnw xxx xxx 0 0 255 255 0 0 preload load file Where basename mandatory designates both the node base name the cluster name and the virtual node name adnw mandatory is administrative network bknw option is backbone network bkgw option is backbone gateway bkdom option is backbone domain icnw option is ip over interconnect network BASS for Xeon Installation and Configuration Guide Note See the loadClusterdb man page and the preload file for details of the options which apply to your system 3 2 5 2 Preload sample files are available in usr lib clustmngt c
219. setup sh and is used to automate and customize the installation process The script reads the slurm conf file created previously and does the following 1 Creates the SlurmUser using the SlurmUID SlurmGroup SlurmGID and SlurmHome optional parameter settings in the slurm conf file to customize the user and group It also propagates the identical Slurm User and Group settings to the reference nodes 2 Validates the pathnames for log files accounting files scripts and credential files It then creates the appropriate directories and files and sets the permissions For user supplied scripts it validates the path and warns if the files do not exist The directories and files are replicated on both the Management Node and reference nodes 3 Creates the job credential validation private and public keys on the Management and reference nodes Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 37 4 lf auth munge is selected as the authorization type AuthType in the slurm conf file it validates the functioning of the munge daemon and copies the munge key file from the Management to the reference nodes 5 Copies the slurm conf file from the Management Node to the reference nodes Note The script is also used to configure the LOGIN and COMPUTE X Reference nodes as described in STEP 5 3 3 11 9 Note Skip the next section which describes how to complete the configuration of SLURM manually if the slurm_setup sh sc
220. sources used mem 2796kb resources used ncpus 1 resources used vmem 167888kb resources used walltime 00 01 00 07 17 2008 16 25 31 S dequeuing from workq state 5 8 If errors are reported then look at the STDIN eO output file for PBS Professional problems and STDIN oO output file for other problems See the PBS Professional Administrator s Guide for more information regarding PBS Professional problems Testing a job launched in parallel 1 Give the test job a name in the example that follows this is HelloWorld 2 Execute the cat run pbs command cat run pbs 3 This will give output similar to that below bin bash PBS 1 select 2 ncpus 4 mpiprocs 4 PBS source opt intel source opt intel d l place scatter PBS N HelloWorl l fce version bin ifortvars sh l cce version bin iccvars sh source opt mpi mpibull2 version share setenv mpibull2 sh d ibmr gen2 mpibull2 mpirun n 8 devices lloWorld 4 Check that the test job was launched successfully across all the CPUs requested as in the example above Installing BAS5 for Xeon v1 2 Software on the HPC Nodes 3 7 3 7 8 3 7 9 3 72 5 If errors are reported then look at run e job ID output file for PBS Professional problems and the run o lt job_ID gt output file for other problems See the PBS Professional Administrator s Guide for more information regarding PBS Professional
221. stgres ICE ICE End ICE ICE End LCE OK OK OK OK OK OK ICE End End ICE OK OK ICE End ICE ICE End LCE synchro for conman Begin synchro for snmptt synchro for snmptt Begin synchro for nagios update by Nagios will take a few minutes OK Resetting host status in DB update by Nagios will take a few minutes is running IC End ICE E End is running synchro for nagios Begin synchro for nsm synchro for nsm su postgres 4 Save the ClusterDB pg dump Fp C f var lib pgsql backups clusterdball name of clusterdbdata sav dmp clusterdb 5 Go back to root by running the exit command 6 Reboot the Management Node exit reboot 3 30 55 for Xeon Installation and Configuration Guide 337 Configuring Ganglia 1 the file usr share doc ganglia gmond 3 0 5 templates gmond conf into etc Edit the etc gmond conf file In line 9 replace deaf yes with deaf no In line 18 replace xxxxx with the basename of the cluster name Xxxxx replace with your cluster name In line 24 replace x x x x with the alias IP address of the Management Node host x x x x replace with your administration node ip address 3 Start the gmond service service gmond start chkconfig level 235 gmond on 4 Edit the etc gmetad conf f
222. stormodelctl m lt model_name gt c applymodel i lt disk_array_name gt v 3 Check the status of formatting operations on the storage systems When the applymodel command has finished the disk array proceeds to LUN formatting operations Depending on the type of storage system this operation can take a long time several hours The progress of the formatting phase can be checked periodically using the following command stormodelctl m model name c checkformat The message no formatting operation indicates that the formatting phase has finished and is OK WARNING Ensure that all formatting operations are completed on all storage systems before using these systems for other operations 4 Once the storage systems have been fully configured reboot all the nodes that are connected to them so that the storage systems and their resources can be detected Configuring I O Resources for the Cluster 5 3 Di lio 5 4 Note The LUN Access control information zoning can be reconfigured using the stormodelctl c applyzoning option once the configuration model has been deployed The LUN configuration and all other parameters are preserved Automatic Deployment of the configuration of O resources for the nodes Note All the storage systems connected to the nodes must have been configured their LUNs formatted and the nodes rebooted before this phase is carried out 1 Check that each no
223. t Only carry out this task during the first installation or if new Ethernet switches have been added to the cluster The Ethernet switches should be as initially set factory settings Install Ethernet switches by running the command as root swtAdmin auto See Chapter 9 Configuring Switches and Cards in this manual for more details 248 Configuring Postfix 1 Edit the etc postfix main cf file 2 Uncomment or create or update the line that contains myhostname myhostname lt adminnode gt lt admindomain gt You must specify a domain name Example myhostname nodeO cluster 3 This step ONLY applies to configurations which use CRM Customer Relationship Management for these configurations the Management Node is used as Mail Server and this requires that Cyrus is configured Uncomment the line mailbox transport cyrus 4 Start the postfix service service postfix start 3 3 0 Configuring Management Tools Using Database Information 1 Run the following commands and check to see if any errors are reported These must be corrected before continuing dbmCluster check ipaddr dbmCluster check rack 2 Configure the tools with the following command as root dbmConfig configur restart forc Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 29 An output example for this command is below Wed Jul 30 09 09 06 2008 NO Wed Jul 30 09 09 06 2008 NO Wed Jul 30 09 09 06
224. t admin backbone gt BASS for Xeon Installation and Configuration Guide logfile lt logfile name gt verbose help This will display a message similar to that below Final install and restart dhcp service stop the dhcpd service Shutting down dhcpd OK J Installing switches installing eswulcO switch 192 168 101 5 fake ip installing eswu0cO switch 192 168 101 4 fake ip installing eswulcl switch 192 168 101 3 fake ip installing eswu0cl switch 192 168 101 2 fake ip installed eswulcO switch installed eswu0c0O switch installed eswulcl switch installed eswu0cl switch switches installed dbmConfig configure service sysdhcpd forc nodeps dbnam clusterdb Tue Oct 16 12 48 33 2007 NOTICE Begin synchro for sysdhcpd Shutting down dhcpd FAILED Starting dhcpd OK Tue Oct 16 12 48 34 2007 NOTICE End synchro for sysdhcpd 6 Delete the temporary configuration files swtAdmin clear 7 Saving the switches configuration Finally when the switches have been installed the configuration parameters will be stored locally in their memory and also sent by TFTP to the Management Node tftpboot directory This is carried out by running the command swtAdmin save dbname database name logfile logfile name gt verbose help This will display a message similar to that below Save configuration of switches Saving switc
225. t has been installed as shown in the examples below Examples PATH usr kerberos bin opt intel fce 10 1 015 bin opt intel cce 10 1 013 bin opt cuda bin usr local bin bin usr bin LD LIBRARY PATH usr local cuda lib opt intel cce 10 1 013 lib opt intel mkl 9 0 lib em 4t opt intel fce 10 1 015 lib opt cuda lib See The BASS for Xeon User s Guide and System Release Bulletin for more information on the NVIDIA compilers and libraries The NVIDIA CUDA Compute Unified Device Architecture Programming Guide and the other documents in the opt cuda doc directory for more information Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 59 3 5 12 Installing RAID Monitoring Software optional 3 5 12 1 Monitoring using the 151 MegaRAID 8408E Adapter Note This kind of adapter is only installed on NovaScale R440 and NovaScale R460 machines Install the MegaCli xxxx i386 rpm package which is available on the Bull Extension Pack CD ROM below delivered with the machines which use these adapters Bull Extension Pack for NovaScale Universal Rack Optimized amp Tower Series with RHELS5 1 No further configuration is required for the NovaScale R440 and R460 machines once the MegaCli xxxx i386 rpm is installed 35 127 Monitoring using the AOC USASLP S8iR Adapter 3 60 Note This kind of adapter is installed on NovaScale R423 and NovaScale R425 machines only 1 Install the StorMan xxxx x86_64 rpm pa
226. t xmit data threshold threshold Port xmit pkts threshold threshold enable 29294560 180 all all delta 200 nd OO OX 6m Cn Om Om com m enable enable enable 8 12 55 for Xeon Installation and Configuration Guide 8 5 8 5 1 8 5 2 8 6 FTP setup The switch management software allows the administrator to upload or download files to or from the switch For this to happen it is vital to have a working FTP setup The FTP server is installed automatically on the Management Node but is not active Run the command below to start the service service vsftpd start FTP configuration menu Enter the FTP configuration menu as follows ssh enable switchname enable switchname s password voltaire Welcome to Voltaire Switch switchname Connecting switchname config switchname config ftp switchname config ftp i Setting up FTP Notes e FTP server must have been configured on the Management Node e The username and password shown below are examples only Use the username and password which applies to your cluster The following settings define the node 172 20 0 102 as the FTP server The switch logs onto this server using Joe s account with the specified password yummy switchname config ftp server 172 20 0 102 switchname config ftp username joe switchname config ftp password yummy The Group menu The group menu is use
227. teps are necessary to complete the configuration of SLURM 1 Create a SlurmUser The SlurmUser must be created before SLURM is started SlurmUser will be referenced by the slurmctld daemon Create a SlurmUser on the COMPUTE X Login IO or LOGIN Reference nodes with the same uid gid 105 for instance groupadd g 105 slurm useradd u 105 g slurm slurm mkdir p var log slurm chmod 755 var log slurm The gid and uid numbers do not have to match the one indicated above but they have to be the same on all the nodes in the cluster The user name in the example above is slurm another name can be used however it has to be the same on all the nodes in the cluster BASS for Xeon Installation and Configuration Guide 3 5 4 5 3 5 4 6 2 Copy the SLURM configuration file on to the reference nodes Copy the following files from the Management Node to the COMPUTE X and combined LOGIN IO or dedicated LOGIN Reference Nodes e etc slurm slurm conf e public key using the same path as defined in the slurm conf file e private key using the same path as defined in the slurm conf file Note The public key must be on the KSIS image deployed to ALL the COMPUTE COMPUTEX Nodes otherwise SLURM will not start Note 3 Check SLURM daemon directory Check that the directory used by the SLURM daemon typically var log slurm exists on the COMPUTE X combined LOGIN IO or dedicated LOGIN Reference Nodes 4 Chec
228. terdb install directory These files will have been provided by manufacturing and are named Type Rack Xan Rack final The format of a MAC address file is as follows rack level level slot mac addr of node mac addr of bmc ip addr of bmc comment For each MAC address file e Identify the rack label from rack table of the ClusterDB which corresponds to the file For example lt Type_Rack Xan final gt might be SNRXA2 final where Rack is SNR a the x coord of rack is A and n the y coord of rack is 2 Execute the command below as the postgres user in order to retrieve the rack label psql c select label from rack where x coord A and y coord 2 clusterdb label e Update the database with the node and hardware manager MAC addresses for the rack by running the command below as the postgres user usr lib clustmngt clusterdb install updateMacAdmin Type 1 gt rack rack label Example usr lib clustmngt clusterdb install updateMacAdmin SNRXA2 final RACK1 e Configure the IP addresses for the BMCs of the rack by running the command below as the root user usr lib clustmngt BMC bmcConfig input Type Rack Xan final Example usr lib clustmngt BMC bmcConfig input SNRXA2 final Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 27 Configuring Equipment Manually If a node has been installed
229. the backups directory on the Management Node 9 Change to postgres and restore the cluster database data dump files by running the commands below su postgres cd backups pg_restor Fe disable triggers d clusterdb preclusterdbdata2041 dmp exit B 1 3 Migrating Cluster DB Data from BASA for Xeon v1 1 The procedure is exactly the same as described in section B 1 2 for BASA for Xeon v1 2 clusters EXCEPT the BASA for Xeon v1 1 specific Cluster DB data RPM clusterdb data ANYAT 1 20 4 1 b x Bull must be installed on the BASA for Xeon v1 1 Management Node This RPM is available on the BASS for Xeon v1 2 XHPC DVD ROM or from Bull Technical Support B 2 Saving and Reinstalling the Cluster DB data Follow the procedure described below to save and to restore cluster database data for a BASS for Xeon v1 2 clusters B 2 1 Saving the Data files 1 Login as the root user on the Management Node 2 Enter su postgres 3 Enter the following commands var lib pgsql backups pg dump Fc C f var lib pgsql backups lt name_of_clusterdball sav gt clusterdb pg dump Fc a f var lib pgsql backups name of clusterdbdata sav clusterdb B 2 BASS for Xeon Installation and Configuration Guide For example lt name_of_clusterdbdata sav gt might be clusterdbdata 2008 1105 sav 4 Copy the two sav files onto a non formattable media outside of the cluster B 2 2 Reinstalling the Data files
230. the default period 11 pm is acceptable for your environment 023 root usr sbin ddn set up date tim s all f 1 This cron synchronizes times for DDN singlets daily Note If the configuration does not include DDN storage systems then the line above must be commented 4 3 3 Enabling Event Log Archiving The syslog messages generated by each DDN singlet are stored in the var log DDN directory or in the varha log DDN directory if the Management Node is configured for High Availability Note The log settings for example size of logs are configured by default Should there be a need to change these sizes edit the etc logrotate d syslog ng file See the logrotate man page for more details 4 3 4 Enabling Management Access for Each DDN 1 List the storage systems as defined in the cluster management database storstat a grep DDN 4 8 BASS for Xeon Installation and Configuration Guide A 3 5 4 3 5 1 This command returns the name of the DDNs recorded in the cluster management database For example ddnO DDN 9500 WARNING No faulty subsystem registered RACK A2 K The next operation must be done once for each DDN system 2 Retrieve the addressing information storstat d n ddn name i H Tip To simplify administrative tasks Bull preloads the ClusterDB with the following conventions DDN system name IP name for singlet 1 IP name f
231. to restart the complete initialisation procedure After a power down or a reboot check the full configuration carefully Check that initialization is correct that the network access is setup and that there is no problem on the DDN systems ddn admin i ip name singlet 1 c getinfo o HW ddn admin i ip name singlet 2 c getinfo o HW Configuration of Storage Management 4 11 4 4 Enabling the Administration of an Optima 1250 Storage System b Mem This section only applies when installing for the first time Note High Availability solution does not apply for nodes which are connected to Optima 1250 Storage Bays See The Storeway Optima 1250 Quick Start Guide for more details on the installation and configuration Storeway Master is a web interface module embedded into the Optima 1250 controllers It allows an Optima 1250 storage system to be managed and monitored from a host running StoreWay Master locally using a web browser across the internet or an intranet There is no particular software which needs to be installed to manage an Optima 1250 storage system 4 4 1 Optima 1250 Storage System Management Prerequisites e Ifthe initial setup was not done by manufacturing a laptop should be available and connected to the Ethernet Port of the Optimal 250 storage system via an Ethernet cross cable e The SNMP and syslogd electronic licenses sent by e mail should be available
232. top command 4 3 load storage sh 6 8 lsiocfg command 4 17 lustre upgrade lustre layout sh script C 2 lustre 1 4 11 C 1 Lustre 1 6 3 C 1 Lustre file system configuration 6 5 Lustre MGS entity C 2 Lustre Migration 1 Lustre Post configuration C 2 lustre cfg file 6 9 lustre investigate command 6 12 M mount points cdrom 3 20 MPI libraries MPIBull2 1 15 N nec admin command 4 nec admin conf file 3 4 4 6 network administration network 1 10 administration network 3 18 backbone 1 10 configuration 3 17 Network Time Protocol NTP 3 32 node compute node 1 7 login node 1 6 Management Node 1 6 NTP configuration 3 32 ntp conf file 3 32 O oppensl 3 38 P partitioning disk 3 10 PCI slots selection H 1 1 2 BASS for Xeon Installation and Configuration Guide postfix configuration 3 29 postfix main cf file 3 29 prod option D 1 S saving ClusterDB 3 3 Lustre file system 3 4 ssh keys 3 4 storage information 3 4 SLURM and openssl 3 38 SLURM and Security 3 38 ssh saving keys 3 4 ssh keygen 4 5 storageadmin directory 3 4 storcheck cron file 4 2 storframework conf file 3 4 switch configuration 9 1 syslog ng port usage 3 31 service 3 32 syslog ng conf file 3 31 syslog ng DDN file 4 8 system config network command 3 17 T Trace Tool Intel installation 7 2 V Voltaire device configuration 9 16 Volt
233. trator NovaScale Master Termin Figure 3 20 The NovaScale Master console Click on the Map link top left to display all the elements that are being monitored BASS for Xeon Installation and Configuration Guide o http xena0 NovaScale Master 5 3 0 Console Mozilla Firefox NovaScale Master Bul F Q 23 5 7 te m 9 Status Map 9 animated by all status 9 Alerts HPC Tools WARNING 40 0 of ports available CRITICAL available 1GB ports Other fiy HOSTGROUPS m perum tat tat ri tus Det Network Outages Problems Config Log Contro 2 Service details Upsiae every Host Service Status LastCheck Duration Information 4 5 FC port ud 09 5h 23m 50s ago 6d 1h 24m 43s 4 FC port s is are faulty v mex CRITICAL at least one of the services is CRITICAL on System status eg su ME 090 5h 23m 505 ago 6d 1 24 435 this disk array ax4 5 1 Controller oar UN BE 0d 8h 58m 265 ago 120 20 53 365 Status is UNKNOWN Disk i ma 8h 58m 26s ago 12d 20h 53m 365 Status is UNKNOWN E javascript showArea Map 1 1 Figure 3 21 NovaScale Master Monitoring Window 24 9 Checking nsctrl nsctrl is a command that allows administrators to issue commands to the node BMCs usage usr sbin nsctrl options action nodes The available actions are reset poweron poweroff poweroff force status ping T
234. tre io node dba 3 Check that the file etc cron d lustre check cron exists on the Management Node and that it contains lines similar to the ones below lustre check is called every 15 mn 15 root usr sbin lustre check gt gt var log lustre check log 2 gt amp 1 6 3 2 Configuring I O Resources for Lustre b Skip this phase when carrying out an update to BASS for Xeon v1 2 or if BASS for Xeon v1 2 is being reinstalled as the Lustre configuration and data files will have been saved At this point of the installation the storage resources should have already been configured either automatically or manually and in accordance with the type of storage system See Chapter 5 Configuring O Resources for the cluster in this manual for configuration details manual and automatic for each type of storage system 6 6 BASS for Xeon Installation and Configuration Guide 6321 6 3 2 2 6 3 3 Configuring I O Resources for Lustre after Automatic Deployment of O Configuration This phase must take place after executing the procedures described in the Automatic Configuration of a Storage System and Automatic Deployment of the configuration of I O resources for the nodes sections in Chapter 5 When carrying out an update to BASS for Xeon v1 2 or if 5 for Xeon v1 2 is being reinstalled do not run the following two stormodelctl commands as the Lustre configur
235. twork lt admin backbone gt dbname database name gt logfile lt logfile name gt verbose help While this command is being carried out a message similar to the one below will appear Discover new switches on the network clear netdisco database network discovering by netdisco application starting from 192 168 101 5 ip WARNING not all new switches has been discoved retry netdisco discovered X new devices 4 Updating MAC address in the eth_switch table When the topology has been discovered it is compared with the database topology If there are no conflicts the corresponding MAC addresses of switches are updated in the eth switch table of the database This is done by running the command swtAdmin mac updat dbname database name gt logfile lt logfile name gt verbose help The following message will appear Update MAC address in the eth_switch table Updating mac address values in clusterdb database 5 Restarting Switches and final Installation Configuration At this step all the switches are restarted and their final configuration is implemented by TFTP according to the parameters in the DHCP configuration file The DHCP configuration file is regenerated and will now include the MAC addresses of the switches obtained during the previous step This is carried out by running the command swtAdmin install dbname lt database name gt network l
236. uide for more information about 3 6 1 Installing Configuring and Verifying the Image Server 3 6 1 1 Installing the Ksis Server The Ksis server software is installed on the Management Node from the XHPC CDROM It uses NovaScale commands and the cluster management database 3 6 1 2 Configuring the Ksis Server Ksis only works if the cluster management database is correctly loaded with the data which describes the cluster in particular with the data which describes the nodes and the administration network Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 61 3 6 1 3 3 6 2 3 62 The preload phase which updates the database must have been completed before using Ksis Verifying the Ksis Server In order to deploy an image using Ksis various conditions must have been met for the nodes concerned If the previous installation steps have been completed successfully then these conditions will be in place These conditions are listed below 1 Start the systemimager service by running the command service systemimager start 2 Each node must be configured to boot from the network via the ethO interface If necessary edit the BIOS menu and set the Ethernet interface as the primary boot device 3 The access to cluster management database should be checked by running the command ksis list The result must be no data found or an image list with no error messages 4 Check the state of the nodes by runn
237. ured their LUNs formatted and the nodes rebooted before this phase is carried out Configuring I O Resources for the Cluster 5 5 1 Check that each node is connected to the correct storage system Check the connection of each DDN storage system using the following command ddn conchk I ddn name f Note This command can only be used if ConMan is in place for DDN storage systems Check that the LUNs are accessible for the storage systems connected to each node by using the command below lsiocfg dv 2 Create aliases from the Management Node without using a model file An alias must be created for each LUN of a storage system connected to a node If multipathing has been configured ensure that all paths to all devices are in the alive state by using the lsiocfg x command If the node is NOT in High Availability pair From the Management Node run the command stordiskname c r node name Then run the command ssh root8 node name stormap c IF the node is in a High Availability pair node1 node2 From the Management Node run the command stordisknam r nodel name node2 name Then run the command ssh root8 nodel name stormap ssh root8 node2 name stormap 3 Check the aliases created for the O resources Use the following command on each node to check that the aliases have been created correctly stormap
238. us dev loop0 on mnt srv lustre MGS type lustre rw mgs 0 is running Configuring File Systems 6 1 1 3 Check the consistency of the database lustre_investigate check This command checks which storage devices in the lustre ost and lustre mdt tables can be used A clean output means that the command has been successful See The lustre investigate man page or the BAS5 for Xeon Administrator s Guide for more details Run the command below to list the OSTs There must have be at least one OST with cfg stat set to available lustre ost dba list Run the command below to list the MDTs There must have be at least one MDT with cfg stat set to available lustre dba list Lustre HA Carry out the actions indicated in the Managing Lustre Failover Services on the I O and Metadata Nodes the lustre_migrate Tool section in the Configuring High Availability for Lustre chapter in the BAS5 for Xeon High Availability Guide 6 3 8 Configuring Lustre 1 Configure Lustre on the I O nodes Run the following command and answer yes lustre_util set_cfg An output similar to the following is displayed lustre cfg copied on lt I O nodes gt snmpd enabled on lt I O nodes gt ldap database enabled on lt mgmt node gt 2 Create the file system configuration The etc lustre models fs1 Imf file is a default model file which comes with the Lustre RPMs It implements
239. ustre directory 3 0 8 Saving the SLURM Configuration The etc slurm slurm conf file is used by the SLURM resource manager It is strongly recommended that this file is saved from the Management Node onto a non formattable media 3 4 BASS for Xeon Installation and Configuration Guide 2 1 STEP 1 Installing Red Hat Enterprise Linux Software on the Management Node This step describes how to install the Red Hat Enterprise Linux software on the Management Node s It includes the following sub tasks 1 RAID configuration optional 2 Installation of the Red Hat Enterprise Linux 5 Server software 3 First boot settings 4 Configuring the Network 5 Installing an external Storage System small clusters only Configure Internal RAID discs for BAS5 for Xeon clusters optional 3 1 1 1 Configure RAID for AOC USASLP S8iR Adapters This kind of adapter is installed on NovaScale R423 and NovaScale R425 machines only Each machine has to be configured individually See Appendix G Configuring AOC USASLP S8iR RAID Adapters for NovaScale R423 and R425 machines in this manual for details on how to configure these adapters Sele Configure RAID for the LSI 1064 chip Note This kind of adapter is only installed on NovaScale R421 E1 machines only 2 1 2 Red Hat Enterprise Linux 5 Installation 3 1 2 1 Initial Steps E NJ ioo Before starting the installation read all the procedure details carefully
240. utines for PBS Professional described below Installing BASS for Xeon v1 2 Software on the HPC Nodes 3 39 3 3 13 1 Downloading Installing and Starting the FLEXIm License Server men This step applies to the Management Node standard installation or to a node which is dedicated as the Flexlm Server s This section only applies to clusters which do NOT feature High Availability for the Management Node NOR redundancy for PBS Pro See The BASS for Xeon High Availability Guide and the PBS Professional Administrator s Guide available on the PBS Professional CD ROM if High Availability for the Management Node and High Availability redundancy for PBS Pro are in place 1 Copy all tarballs and documentation from the PBS Professional CD ROM on to the Management Node 2 Uncompress and extract the files using the command below tar xvzf altair_flexlm lt version gt lt architecture gt tar For example tar xvzf altair flexlm 9 0 amd64 s8 tar 3 Run the command below to start the installation process licsetup sh 4 Respond to the questions as they appear identifying the location where the licensing package will be installed opt is recommended This location is known as install loc 5 Copy the license file provided by Bull technical support to the folder install loc gt altair security altair_lic dat 6 the following commands to start the FLEXIm license ser
241. ver cd lt install loc gt altair security altairlm init sh start 7 To install the license startup script run the following command install loc altair security install altairlm sh 3 40 55 for Xeon Installation and Configuration Guide 3 9 13 2 3 3 13 3 Starting the installation of PBS Professional The commands for the installation have to be carried out by the cluster Administrator logged on as root 1 Extract the package from the PBS Pro CD ROM to the directory of choice on the Management Node using a command similar to that below cd root PBS tar xvzf PBSPro 9 2 0 RHEL5 x86 64 tar gz 2 Go to the installation directory on the Management Node and run cd PBSPro 9 2 0 3 Start the installation process INSTALL PBS Professional Installation Routine During the PBS Professional installation routine the Administrator will be asked to identify the following Execution directory The directory into which the executable programs and libraries will be installed for example usr pbs Home directory The directory into which the PBS Pro daemon configuration files and log files will be installed for example var spool PBS PBS installation type The installation type depends on the type of node that PBS Professional is being installed on Management Node type 1 Do you want to continue Answer Yes License file location In the example above this i
242. witch config line login myswitch config line exit 10 Enable the telnet connections and set a password myswitch config line vty 0 15 myswitch config line password admin myswitch config line login myswitch config line exit 11 Exit the configuration myswitch config exit 12 Save the configuration in RAM myswitch copy running config startup config 13 Update the switch boot file on the Management Node Run the following commands from the Management Node console touch tftpboot switch configure file chmod ugotw tftpboot switch configure file Note switch configure file name must include the switch name followed by config for example myswitch confg 14 Save and exit the switch configuration from the switch prompt myswitch copy running tftp myswitch exit Enter the information requested for the switch For the tftp server indicate the IP address of the Service Node which is generally the tftp server 15 Disconnect the CISCO Switch 9 10 55 for Xeon Installation and Configuration Guide Once the switch configuration has been saved and the Administrator has exited from the interface it will then be possible to disconnect the serial line which connects the switch to the Linux Management Node 16 You can check the configuration as follows From the Management Node run the following command telnet 10 0 0 254 Enter the password when re
243. witchname config if fast broadcast fast set broadcast IP address gt 8 2 8 Route setup 8 2 8 1 Route configuration menu The route can be set from the following menu ssh enable switchname 8 4 BASS for Xeon Installation and Configuration Guide enable switchname s password voltaire Welcome to Voltaire Switch switchname Connecting switchname config switchname config route switchname config route 8 2 8 2 Setting up the route Set the route as follows switchname config route default gw fast set gateway ip address gt Check that the route is fine switchname config route default gw show It is strongly advised to reboot the switch after modifying the route parameter 8 2 9 Routing Algorithms The following routing algorithms are possible Balanced routing Rearrangable or Up down switchname config sm sm info algorithm set algorithm e Balanced outing is good for CLOS topologies when using a pruned network e Up down is the best routing algorithm on fully non blocking networks e Rearrangeable routing may impact performance 8 2 10 Subnet manager SM setup 8 2 10 1 Subnet Manager Configuration menu Enter the subnet manager configuration menu as follows ssh enable switchname enable switchname s password voltaire Welcome to Voltaire Switch switchname Connecting switchname config switchname config sm Installing a
244. x Kdump SLURM and PBS Pro e Install compilers only on Management Nodes which include Login functionality e Configure the User environment and in particular MPI E b me If your cluster has been delivered with the ClusterDB preload in place or if you have saved your cluster database from a previous installation go to the section Configuring Management Tools Using Database Information Generate the SSH keys 1 Change to the root directory on the Management Node ehm 2 Enter the following commands ssh keygen t rsa Accept the default choices and do not enter a pass phrase cat ssh id_rsa pub gt gt ssh authorized_keys 3 Test the configuration ssh localhost uname The authenticity of host localhost 127 0 0 1 can t be established RSA key fingerprint is 91 7e 8b 84 18 9c 93 92 42 32 4a 02 9 38 e69 fc Are you sure you want to continue connecting yes no yes Warning Permanently added localhost 127 0 0 1 RSA to the list of known hosts Linux Then enter ssh clustername 0 uname BASS for Xeon Installation and Configuration Guide 3 3 2 Configuring Equipment t b mer Only carry out this task during the very first installation The purpose of this part is to collect the MAC address for each node in the cluster and to configure the hardware manager often called the BMC for these nodes Look for the MAC address files in the usr lib clustmngt clus
245. y on the Node mount t nfs Management Node IP release releas 2 Run command below to install the SLURM RPMs yum install slurm pam slurm slurm munge slurm auth none slurm devel See The Software Release Bulletin for BAS5 for Xeon v1 2 for details on how to install SLURM version 1 0 15 This version is required to ensure compatibility with the LSF Batch Manager After the SLURM RPMs have been installed some steps remain before the configuration of SLURM is complete on the Reference Nodes These steps can either be done using the slurm setup sh script see section 3 5 4 3 OR manually see section 3 5 4 4 Installing Munge on the Reference Nodes optional Munge is installed as follows on clusters which use this authentication type for the SLURM components 1 Run the command below on the COMPUTE and I O reference nodes yum install munge munge libs BASS for Xeon Installation and Configuration Guide 3 5 4 3 2 Run the command below on the COMPUTEX and LOGIN reference nodes yum install munge munge libs munge devel Note munge and munge libs are installed by default as part of the standard SLURM installation and are included in the commands above as a check See Configuring SLURM on the Reference Nodes using the Setup script tool The Slurm setup script is found in etc slurm slurm setup sh and is used to automate and customize the installation process Section 3 3 1 1 4 for a description of
246. ystems mkdir release mkdir home_nfs 2 Edit the etc fstab file and add the NFSv3 file system lt nfs server gt release release nfs defaults 0 0 lt nfs server gt home_nfs home_nfs nfs defaults 0 0 3 Mount the NFS file systems 6 4 BASS for Xeon Installation and Configuration Guide mount release mount home_nfs 6 3 Configuring the Lustre file system E For clusters which include High Availability for Lustre this section should be read alongside the Configuring High Availability for Lustre chapter in the BAS5 for Xeon High Availability Guide Lustre HA pointers are included throughout this section These indicate when and where the additional configurations required for Lustre High Availability should be carried out This section describes how to Initialize the information to manage the Lustre File System Configure the storage devices that the Lustre File System relies on Configure Lustre file systems Register detailed information about each Lustre File System component in the Cluster DB important These tasks must be performed after the deployment of the 1 O Nodes Unless specified all the operations described in this section must be performed on the cluster Management Node from the root account See If there are problems setting up the Lustre File System and for more information about Lustre commands refer to the BAS5 for Xeon Administrator s Guide
Download Pdf Manuals
Related Search
Related Contents
Samsung SPF-86V Инструкция по использованию MANUAL DE INSTRUÇÕES EXAUSTOR 04/ES425 カタログ等資料中の旧社名の扱いについて Manuale - Bombeeck Digital User Guide Guide de l'utilisateur フジグラス粒剤 17 LabAssay Glucose Sony XL-100 User's Manual Irox EBR606C weather station KEH-P4900 Copyright © All rights reserved.
Failed to retrieve file