Home

86A290EW00-HPC BAS5 for Xeon V1 - Support On Line

image

Contents

1. S LONTY pz 1303 ZIGZ S8ZENSI 0 0 1NYY 2 Y NOTLYD07 peaepdn sgy y2ates 12 paaepdn s ss sppe JI pz paaepdn suo t3e 112307 pZ aug pt 618303 g pouBtsse 30u py paubt se PL 518303 Q pa te4 pe paaepdn Asanby32d jutg pago 2307 fasn Butan PE 50307 0 ps1 tay se pasepan Asanbdus futq pejyo 18307 fusn Gutsn sated ge 318303 sated pg pouBtsse PL 218203 pe peubtsse Q ISTssey2 g speoq pinoy psroq on 9606YSI E PEOGYSI 0 SDISY IZ 9H uog uog uog uog TOGOXG RPSZIPOGPOTI SOGOXO aracon 119071 Q GIGNAET 2ITRIJOA H OPZOGYSI SWYNLSOH 01N9300N NOIL T9530 QP42350 gt sseqezep wosy 501 y2ates Gurzepdy gp4e3sn 12 seqezep wos sesseuppe JI 3usudinba 6utzepdn Qpse3sn 2 seqezep wos uotyest e20 4uawdtnba Burzepdy ZEPS 3504 230 3504 uo qp 3351n 3 Sseqezep 03 Butz22uU0D s4aquno2340d Butubrasy Asenbysed 9330 Gutsn s4 3uno gt 340d Burzepdo Asenbjsed wesboud 404 Butx007 Asenbdus 9330 Sursn s weuzsoy Gurzepdy Asenbdus wesBosd Joy Surmoo7 340d 6ut3 29UU03 S3504 gI 03 s340d Gurubtesy Spseog yz SISSRYO y res 6utzeqndog spseog Surze ndoy s3s504 gI Sutea qu deyyJorg gt y Butpeoqurmog Q QI0NAST y zta
2. System part number N8100 1243E R440 SAS BIOS 5846 Motherboard Jumper settings JSASRAID2 1 2 RAID disable BIOS setup section parameter value System Date Hard Disk Pre Delay Disabled Processor Settings Processor Retest No Execute Disable Bit Disabled Intel R Virtualization Tech Disabled Enhanced Intel SpeedStep R Tech Disabled Language Advanced Memory Configuration Memory Retest No Extended RAM Step Disabled Memory RAS Feature Interleave Sparing Disabled PCI Configuration Onboard Video Controller yGA Controller Enabled Onboard VGA Option ROM Scan Auto Onboard LAN LAN Controller Enabled LAN1 Option ROM Scan Enabled LAN2 Option ROM Scan Enabled PCI Slot 1B Option ROM Enabled PCI Slot 1C Option ROM Enabled Peripheral Configuration Serial port A Enabled Base I O address 3F8 Interrupt IRQ 4 Serial port B Enabled Base I O address 2F8 Interrupt IRQ 3 USB 2 0 Controller Enabled Parallel ATA Enabled Serial ATA Enabled SATA Controller Mode Option Compatible Advanced Chipset Control Multimedia Timer Intel R OAT Wake On LAN PME Enabled Wake On Ring Disabled Wake On RTC Alarm Disabled Boottime Diagnostic Screen Reset Configuration Data NumLock Memory Processor Error Boot Security Supervisor Password ls Clear User Password Is Clear Password on boot Disabled Fixed disk boot sector Normal Power Switch Inhibit Disabled Server Console Redirection BIOS Redirection Port Serial Por B Managing th
3. Boot Option Retry Disabled Managing the BIOS on NovaScale R4xxx Machines 7 9 TIA NovaScale R422 BIOS Settings motherboard X7DBT X7DGT R422 BIOS 1 3c BIOS setup section parameter value Main System Time System Date Serial ATA Enabled Native Mode Operation Serial ATA SATA Controller Mode Option Compatible Advanced Boot Features QuickBoot Mode QuietBoot Mode POST Errors ACPI Mode Yes Power Button Behaviour Instant Off Resume On Modem Ring Off Power Loss Control Last State Watch Dog Disabled Summary screen Memory Cache Cache System BIOS area Write Protect Cache Video BIOS area Write Protect Cache Base 0 512k Write Back Cache Base 512k 640k Write Back Cache Extended Memory Area Write Back Discrete MTRR Allocation Disabled PCI Configuration Onboard G LAN1 OPROM Configure Enabled Onboard G LAN2 OPROM Configure Disabled Default Primary Video Adapter Onboard Emulated IRQ Solution Disabled PCl e I O Performance Payload 2568 PCI Parity Error Forwarding Disabled ROM Scan Ordering Onboard First Reset Configuration Data No SLOTI PCl Exp x8 Option ROM Scan Enabled Enable Master Enabled Latency Timer Default Large Disk Access Mode DOS Advanced Chipset Control SERR signal condition Single bit 4GB PCI Hole Granularity 256 MB Memory Branch Mode Interleave Branch O Rank Interleave lt 4 l Branch O Rank Sparing Disabled Branch 1 Rank Interleave lt 4 l Branch 1 Rank Spar
4. Memory Cache Cache System BIOS area Write Protect Cache Video BIOS area Write Protect Cache Base 0 512k Write Back Cache Base 512k 640k Write Back Cache Extended Memory Area Write Back Discrete MTRR Allocation Disabled PCI Configuration Onboard G LAN1 OPROM Configure Enabled Onboard G LAN2 OPROM Configure Option ROM Re Placement Disabled PCI Parity Error Forwarding Disabled PCI Fast Delayed Transaction Disabled Reset Configuration Data No Frequency for PCIX 1 2 Auto SLOTO PCI U X8 Option ROM Scan Enabled Enable Master Enabled Latency Timer Default SLOT PCI X 133MHz Option ROM Scan Enabled Enable Master Enabled Latency Timer Default SLOT2 PCI X 133MHz Option ROM Scan Enabled Enable Master Enabled Latency Timer Default SLOT3 PCI Exp x8 Option ROM Scan Enabled Enable Master Enabled Latency Timer Default SLOT4 PCI Exp x4 Option ROM Scan Enabled Enable Master Enabled 7 16 BAS5 for Xeon Maintenance Guide BIOS setup section parameter value Latency Timer Default SLOTS PCHExp x8 Option ROM Scan Enabled Enable Master Enabled Latency Timer Default SLOT6 PCHExp x8 Option ROM Scan Enabled Enable Master Enabled Latency Timer Default Large Disk Access Mode DOS Advanced Chipset Control SERR signal condition Single bit Clock Spectrum Feature Disabled Intel VT for Directed I O Disabled AGB PCI Hole Granularity 256 MB Memory Voltage Auto Memory Branch Mode In
5. Console Redirection Com Port Address Baud Rate Console Type Flow Control Console connection Direct Continue C R after POST AA Hardware Monitor Fan Speed Control Modes 1 Disable Full speed IPMI System Event Logging Enabled Clear System Event Log Disabled SYS Firmware Progress Disabled BIOS POST Errors Enabled BIOS POST Watchdog Disabled OS boot Watchdog Disabled Timer for loading OS min 10 Time out action No Action Security Supervisor Password ls Clear User Password Is Clear Password on boot Disabled Boot NO Oa bh WN 7 18 BASS for Xeon Maintenance Guide ES NovaScale R440 SATA BIOS Settings System part number N8100 1241E R440 SATA BIOS 5536 Motherboard Jumper settings JSASRAID2 1 2 RAID disable BIOS setup section parameter value System Date Hard Disk Pre Delay Disabled Primay IDE Master Type Auto 32 Bit1 O Enabled i Processor Settings Processor Retest No Execute Disable Bit Disabled Intel R Virtualization Tech Disabled Enhanced Intel SpeedStep R Tech Disabled language English US Advanced Memory Configuration Memory Retest No Extended RAM Step Disabled Memory RAS Feature Interleave Sparing Disabled PCI Configuration Onboard Video Controller VGA Controller Enabled Onboard VGA Option ROM Scan Auto Onboard LAN LAN Controller Enabled LAN1 Option ROM Scan Enabled LAN2 Option ROM Scan
6. time to wait after poweroff for all powerswitches being OFF couplets_StopDelay 30 FE FEAE FE AE AE FE AE FE FE AE FE AE AE FE AE FE FE AE FE AE AE FE AE FE FE AE FE FE AE FE AE FE FE AE FE FE AE FE AE FE FE AE FE FE AE FE AE E FE AE FE FE AE FE AE E TE E FE RE EH Following part is used to control the order to stop nodes groups FE FEAE FE AE E FE AE FE FE AE FE AE AE FE AE FE FE AE FE AE AE FE AE FE FE FE FE FE AE FE AE FE FE AE FE FE AE FE AE FE FE AE FE FE AE FE AE E FE AE FE FE AE FE AE E FE AE FE RE EH GROUP lt nb simultaneous poweron gt lt time to wait gt lt period to wait gt lt time to wait after this GROUP gt 2 6 BASS for Xeon Maintenance Guide 2 2 3 Managing hardware nsctrl The nsctrl command carries out various tasks related to hardware This command must be run from the Management Node The tasks can be performed on any type of node Compute Node I O Node etc except the Management Node Usage usr sbin nsctrl options lt action gt lt nodes gt General Options debug Debug mode more than verbose dbname name Specify database name force f Do not ask for confirmation or state checking group g Specify a group of nodes You can use the dbmGroup show command to display the defined groups help h Display nsctrl help interval i Specify the number of nsm calls before waiting the period defined by the time option jobs j Number of simul
7. For security and tracking purposes and also to decrease the amount of administration work resulting from the size of the cluster all the system logs are centralized on the Management Node There are two ways to send system log information to the Management Node e The logs are collected on each node using standard mechanisms for archival and log file permutation Various utilities ensure compression transfer and archival of these log files on the Management Node in asynchronous mode A centralized operation is performed on the Management Node in order to extract and search events according to the criterion required for example date type gravity and so on This asynchronous process facilitates curative actions for the incidents that have occurred on the cluster e Some events are immediately reported to the Management Node Filters are used which specify the type and gravity level of the events that have to be transferred immediately This synchronous process instantaneously gives the administrator a global view of system events syslog ng Syslog New Generation is the powerful system log manager used on Bull HPC clusters to manage cluster system logs and includes the following features e The ability to filter messages based on content using regular expressions e Encoding and authentication of the network traffic e Forwarding logs using TCP and UDP protocols e Log compression BAS5 for Xeon Maintenance Guide 22 51
8. a tove 2 Teoexe 8 8 o g30nsst pazo Vexova T Y 8000x 11 TI gsnez T sp4e2stpaux 0000 3 MOWS Z 8tooro 8 8 T g29nest EEO 3 10 T Eloox0 tal TI 6isn gt z suas Wda NOILYD07 3L0WJY OI NId ol SWYNISOH SLOWSY Wda NOILY307 017 NId 1304 SWYNLSOH paaepdn sar yaz tes Zz gp4e3sn gt eseqe ep wo44 sar yaz tas 6u tzepdfn paaepdn sesseuppe JI PZ gp4e3sn 2 seqezep wos4 s ss sppe JI 3usudinb gt Gutzepdn pazepdn Su0 T3285 11 2907 1e qp 4335n seqezep wo 44 u4O0T3RS 17230 guaud inba 6urtzepdy auog TEYS 53504718307 3504 uo qp 423507 3 seqezep 03 6u un 32UUO FL 318203 g peuBbtsse jou py paubrisse s4azuno2340d Gurubtssy PL 218303 Q p2 184 pL paaepdn Aaenbysed 9330 Buren suszunos 340d Burzepdo Asanbs sed futq pejo 8307 asn Buren Aaenbysed wesGoud 104 Burxqoo7 pZ 519303 Gg P9 184 pz paxepdn Asanbdus g330 Buren ssweuzsoy Gurzepdy Asanbdus futq payo A220 fasn Buren Asanbdus wesboid 104 Gutyoo7 sated e 18303 suted ye poubtsse s340d Butz 29uu0 gt FL 218303 p paubrese s350y gI 03 s3J0d Butubrssy 0 A E o 5p480q sp 4Roq ym stssey gt yaz tes 6utzeqndog punog p4eoq ON spaeoq Butzeqndog pZ 19303 G ZIOZ B8ZGYSI 9GOGYSI E PZOGYSI SJISY IZ IH s3soy gI 6utze 1 config This action manually creates the instruction sequence needed to configure the hostname mapping for a switch gt Note This option only applies to Voltaire sw
9. switchname utilities switchname utilities Once in the utilities menu check which firmware version is installed switchname utilities firmware_verify_anafa_II Scan Fabric Default fw_version is 00 08 06 Updating the firmware for the InfiniBand switches 5 1 3 2 5 2 1 5 2 Configuring FTP for the firmware upgrade If the switch firmware requires an upgrade the FTP options for the switch will need to be set These may already be in place following the initial Installation and Configuration of the cluster If not they are put into place as follows Installing the FTP Server To install the FTP server vsftpd proceed as follows rpm ivh lt path_to_vsftpd lt version gt lt arch gt rpm gt By default the vsftpd daemon will not allow root access to the FTP server For security reasons it is advised to create a dedicated user for this purpose However if you wish to enable root access to the FTP server vsftpd can be enabled to allow this as follows 1 Edit etc vsftpd ftpusers file and comment out the line that starts by root as shown below Users that are not allowed to login via ftp root Bin 2 Edit etc vsftpd ftpuser_list and comment out the line that starts by root as shown below etc vsftpd user_list vsftpd userlist If userlist_deny NO only allow users in this file If userlist_deny YES default never allow users in this file and do not even prompt for a password
10. 91snez Ttsnez gsnez ssnez GI7300N SWYNISOH 23704 H OPZOGYSI TRF OA W OPZGENST TRF OA W OPZGENST TRF OA W OPZGEYST TR7 0A W OPZOGYSI TRF OA W OPZOGYSI TRF OA HW OFZOGYSI TRF OA NW OPZOGYSI T DH 61 n3z T DH 515092 T 3H 13n32 SOHTUTSUT SIZSTIW SOHTUTJUT BIZSTIW SOHTUTJUT BIZSZTIM SOHTUTJUT BIZSTIM NOIldTy gt s30 S LONTY J183 04 OPZOGYSI yates yates yates yates yas yates yas 433105 RpSZIPOOPOIFSOOOXO RPSZIPOGPOT4B0QOIQ n Le Le mn RySZIPOOPOTFEGGOXG RPSzIPOOPOTIBOGOXO RpSzIPO0POTFBOGOXO RyszIPO0rOIFSOGOXO EyazIPO0POTFSOGOXO RPSzIPO0POTIEOGOXO RPSzIPO0POTIBOGOXO 80245Z00Z06 gt Z000x0 01935Z00Z063Z000xG reeqrzeozac2zocez0 869PZOOZOGAZOGOXO 0L8qPZ00Z06 gt ZO00XA resqrzoeze62zoo070 ereqrzaazae2zec0x0 l 0INIIDONINOS SPETIPOGPOTISOGOXO Rps2IP00P0 14500010 RpsZIP00P0 1IFS000TO EPSZ IPO00P0 T4B000IQ RPSZIPO0PO 1IF500D0XO RpszIP00P0 1F3000XG eps2IP00P0 TF5GG070 SPSZ IP00P0 1FS00OTA 60935200206 2700070 11935700706 ZOGO S6GIPTOOZOG ZOGBXO S369P 200206200010 TLEPZOOZO6 ZOBOXO Sesqozo00Za6 ZO00ra Tr6IPZOBZO6IZOGOXO 019 104 IPETIPOOPO IF5G0070 ST st I El at It et ST ST LAS El zt It Br coconononoonooso in AAA ADA A
11. errors warnings informative events and report them Both the PM and the SM generate events and report them to the event notification mechanism In addition events may be generated in the fabric and sent to the SM by fabric elements The SM reports those events as well The event mechanism can do the following actions with each event a Log the event in the event log b Issue a trap to the GUI session c If the event corresponds to an alarm it is also sent to the current alarm mechanism The GUI Color coding is defined according to traps and events severity as described below Critical Critical means that the system or a Invalid link Duplicate or Major system component fails to operate conflicting ports or path Yellow Warning Warning minor reflects a problem in the Broken link Illegal connections Minor fabric but does not prevent its operation between two sLB ports A warning is asserted when an event is exceeding a predefined threshold 3 4 Normal Information Notification provided to the Complete subnet user of normal operating state or a reconfiguration Create Delete normal system event Multicast group Applied routing scheme Port State Change BAS5 for Xeon Maintenance Guide 3 2 Troubleshooting InfiniBand Stacks A suite of InfiniBand diagnostic tools are provided with the Bull Advanced Server There exists a hierarchical dependency for these tools as shown in the diagram below For
12. ye 218203 g paubrsse 30u pz poubrsse pt 218303 Q pe tey pe paaepdn Asanby 12d futq pago 8307 fasn Butan pz 2718303 0 pa 184 pz peaepdn Asanbdus u tq pa30 12207 43n Butan sated ge 218303 suted ge peuftsse FL 218303 p poubisse Q iStssey gt g spieog puno p4Roq on pZ 218303 G ZIOZ GBZGNSI GO 9606USI E PZOGUSI G SDISY IZ OH uog uog uog uog TZGBOPEPLZ ESODOSEEBE S60 seor vz ZENLLGGS OODEZEZGIE SEor seor lez ESSEPEPOPE SGELOGPGZ SEOF sear zz ores Teeset la Les z PrBZEDESS 6L90wDZES SEOr seor oz SBLTGLSZI B8z0S seor It 6t PLYG zrezsess esz seor Bt sezeserezr Sezeserezr S607 seer et SOSEEzErs2 02090959 S60P seer st sezeserezr EEODZ seor l st SBPSET otsoszecsz Lp seor eas BBLPY sszvztros 6 sear ler 28088 SEvEz Is Le Lal BISEEPIT ZESOLPEG S60r seor It sezeserecr EPISIGPDPE SEOF seor ot PPSIGDPZEE Sezeserezr S60r seor 16 T spseost ES9909S8 LPSSLOEGT SEOF s60p La Tp 41235 192968 sztecot 9z ve le GTLOLL zeeezzt BT 992 173 rrzesse9 SBZL9STS s60r seor Is ZSLODL Beseezt BT 392 Ir Z peddoup S E089 se9s88 Lor Lar le Did LINX GW ADM GW LINK NId H0d w301 Q QIQNPST 24123704 N OPZOGUSI SWYNISOH 0IND300N NOT1dTY
13. 3KR0JT0T00007548GUXA TRANSPORT SPI sdd 2 0 0 0 8 48 DDN running 10000 30 dev ldn ddn0 13 MODEL DDN S2A 8500 FWREV 5 20 SERIAL 02A820510D00 TRANSPORT FC WWPN 24 00 00 01 ff 03 02 a8 NAME unknown sde 2 0 0 1 8 64 DDN running 125000 30 dev ldn ddn0 14 MODEL DDN S2A 8500 FWREV 5 20 SERIAL 02A820540E00 TRANSPORT FC WWPN 24 00 00 01 ff 03 02 a8 NAME unknown sdf 2505052 8 80 DDN running 10000 30 dev ldn ddn0 15 MODEL DDN S2A 8500 FWREV 5 20 SERIAL 03E020570F00 TRANSPORT FC WWPN 24 00 00 01 ff 03 02 a8 NAME unknown sdg 2202083 8 96 DDN running 125000 30 dev ldn ddn0 16 MODEL DDN S2A 8500 FWREV 5 20 SERIAL 03E0205A1000 TRANSPORT FC WWPN 24 00 00 01 ff 03 02 a8 NAME unknown Disk Usage and Partition Inventories These inventories give information about system and logical use of the devices Such information is mostly used for system administration needs Day to Day Maintenance Operations 2 35 2 4 5 Checking Device Power State pingcheck The pingcheck command checks the power state on or off of the specified devices Usage pingcheck options Type lt device type gt command devices Options dbname name Specify database name debug d Debug mode more than verbose help h Display pingcheck help interval i Specify the number of nsm calls before waiting the period defined by the time option jobs j Number of simultaneous nsm actions for example with j 5 you can run 5 si
14. Enabled PCI Slot 1B Option ROM Enabled PCI Slot 1C Option ROM Enabled Peripheral Configuration Serial port A Enabled Base I O address 3F8 Interrupt IRQ 4 Serial port B Enabled Base I O address 2F8 Interrupt IRQ 3 USB 2 0 Controller Enabled Parallel ATA Enabled Serial ATA Enabled SATA Controller Mode Option Compatible Advanced Chipset Control Multimedia Timer Intel R I OAT Wake On LAN PME Enabled Wake On Ring Disabled Wake On RTC Alarm Disabled Boottime Diagnostic Screen Reset Configuration Data No Numlock Jon Memory Processor Error Boot Security Supervisor Password Is Clear User Password Is Clear Password on boot Disabled Managing the BIOS on NovaScale R4xxx Machines 7 19 value BIOS setup section parameter Fixed disk boot sector Normal Power Switch Inhibit Disabled Server Console Redirection BIOS Redirection Port Serial Por B ACPI Redirection Port Disabled Baud Rate Flow Control Terminal Type VT100 Remote Console Reset Enabled sid Assert NMI on PERR Enabled Assert NMI on SERR Enabled FRB 2 Policy Retry 3 Times Boot Monitoring Disabled Boot Monitoring Policy Retry 3 Times Thermal Sensor Enabled BMC IRQ IRQ 11 Post Error Pause Enabled AC LINK Last State Power On Delay Time 0 Platform Event Filtering Enabled Boot AN 00h WN 7 20 BASS for Xeon Maintenance Guide 7 3 8 NovaScale R440 SAS BIOS Settings
15. NOILATYIS3O 4435113114 TBrLLEGD ESBEPSPOPE OPLOS prEZESESS SBLZELSZI PLPG S6TLIGPETP SOSEEZEREZ sezescrecr SBPSET strt EB0S3 BSISEEPIT sezeserecr PRSIESPZBE ES990958 1923968 GTLOLE PRZESBES TSLO9L S E089 Did ADY 3L0WJY ESO9OSGESE BODEZELEIE sezeserecr TLLSEL GL9GRSZES 5820s Ezreesess seceserecr ezesosssr eeesz GISesc6esz SSZrZTFOS SEPEZ TESOLPES EPISIGPOPE sezeserecr LPSSLGECT St TE6GT 2e66221 S87 9518 BESEEZT 59588 Did LINX sear s60r sear 8 s60r s60r 6sz s60r s60 s60r Lv 6 S s60r s60r s60r s60r sz 81 sear 81 or an s or St los sear st los s60 T o s Les Leal los sear zt os It tt laos seor OL los sear l6 1905 s or St 1950 S le st los scor yt os sear Let laos le zt laos sear ua los s60 or laos sear 16 los seer Ft c pauropyut 9 O S ve 1 T paumopqur os 992 IE o s seor It os poz 1 os Lar Ex o s aW an LIWX NId syowya xr xr Xr xr xr xr Xr xr Xr xr xr Xr xr xr Xr xr xr Xr xr xr xr xr 03345 HLOIA Dd AJY l Q THIYY Z T1O00XO RFSZIPOOPOTFEOGOXO l NOTAYIO1 GIT300N pesepdn sgy y te 12 peaepdn s ss Jppe di rz pezepdn suotqest 220 pZ uog
16. R 2 3 2 BIOS Version MT25 MPT Version MPTFW 01 15 20 00 IT FW Version 1 02 00 0119 WebBIOS Version 1 01 24 Ctrl R Version 1 02 007 Pending Images In Flash None gt Note The following MegaRAID card details are also provided when the AdpAllinfo command runs PCI slot info Hardware Configuration Settings and Capabilities for the card Status Limitations Devices present Virtual Drive and Physical Drive Operations supported by the card Error Counters and Default Card Settings Updating the firmware for the MegaRAID card 6 1 2 Decompress and extract the firmware by running the command below unzip lsi 5 1 1 0054_ SAS FW_Image_1 03 60 0255 zip Archive root 1si 5 1 1 0054_ SAS FW_Image_1 03 60 0255 zip inflating sasfw rom inflating 5 1 1 0054 SAS FW_Image_1 03 60 0255 txt extracting DOS_MegaCLI_1 01 24 zip 3 Update the firmware using the MegaCLI tool using the command below opt MegaCli adpfwflash f sasfw rom a0 Adapter 0 MegaRAID SAS 8408E Vendor ID 0x1000 Device ID 0x0411 FW version on the controller 1 02 00 0119 FW version of the image file 1 03 60 0255 Flashing image to adapter Adapter 0 Flash Completed 4 Reboot the server so that the new firmware is activated for the card 6 2 BASS for Xeon Maintenance Guide Chapter 7 Managing the BIOS on NovaScale R4xxx Machines This chapter describes how to update the BIOS on NovaScale R4XX m
17. Teo Ei G ognast T D Z20DWWY T 8000zo TI TI psn z z pausopqur 000 0 3 240W8 1 Y Ttooxo TI til ozsnez o 8 DOY Z tooxo s s Z G2Q9N sT separo ze a exovy 1 Y stoto 11 1 Ezsnaz 01 8 DOY 2 Y tooto Bl 8 z 0I0nAET Tspa ue ma 0 28 1 Y 000 0 11 Til Tesnez ol 8 DOWW 2 Y tooxo 31 a Z 929NeET z pse2 tpwx po0 0 0 DOYY 2 Teeexe Tz tz 0 ognast 01 B DOWW 2 Y tooo et et Z Q29NeET T paueopyut YA a tove 2 toooxe rz zi Q g2gnast ol 8 D0WW 2 Y toro St st z GI0NAEL y pausopqut 901 D TADWS I W Steere TI 1 zzsnez 01 8 DOWW 2 Y tooxo i L Z o29n st D T pausopqur VA a WYY 2 Y tooro et 6t I Q ognast 01 8 DOWW 2 Y tooxo It Tt Z o20n st Q T Spsu28tpyux 1000 3 10 Z tooro L L T G30NAST 01 3 200WY T Y vtooxo tI TI anz 5 3 pausopyur SE 0 23Z Zy3YY 1 Y 6000 0 Tl til gsnez ol 3 10 Z stooxo TI TI t og O spaueapqur EEES 3 10 I W Etooxo 11 1 6tsn z o 3 10 2 stooo Bl 8 T 930nAS 1 Tepaueepyury SE S IZ TADWS T so00r0 tl tit gen z o 3 10 2 stooro z z T g2gnest E oz p gt s tpw T O 0 DOWS 2 Y toooxo E 6 p 030na rt 01 3 10WW 2 Y stooxo 6l E 1 020na 1 O G pausopqur czo 3 200W8 1 Y ytooxo TI ti alsnaz 01 3 10 2 Y stooxo L el T 929n ST E Sp 4025 1pQUX Z 49A02944UT
18. Timer Default Large Disk Access Mode DOS Advanced Chipset Control SERR signal condition Single bit Clock Spectrum Feature Disabled Intel VT for Directed I O VT d Disabled AGB PCI Hole Granularity 256 MB Memory Voltage Auto Memory Branch Mode Interleave Branch O Rank Interleave lt 4 l Branch O Rank Sparing Disabled Branch 1 Rank Interleave lt 4 l Branch 1 Rank Sparing Disabled Enhanced x8 Detection Enabled Demand Scrub Enabled High Temp DRAM OP Disabled AMB Thermal Sensor Disabled Managing the BIOS on NovaScale R4xxx Machines 7 13 BIOS setup section parameter value Thermal Throttle Disabled Global Activation Throttle Disabled Force ITK Config Clocking Disabled Snoop Filter Enabled Crystal Beach Feature Enabled Route Port 80h cycles to LPC High Precision Event Timer No USB Function Enabled Legacy USB Support Enabled Advanced Processor Options Frequency Ratio Default Core Multi Processing Enabled Machine Checking Enabled Fast String operations Enabled Thermal Management 2 Enabled C1 C2 Enhanced Mode Disabled Execute Disable Bit Enabled Adjacent Cache Line Prefetch Enabled Hardware Prefetcher Enabled Set Max Ext CPUID 3 Disabled Direct Cache Access Disabled Intel R Virtualization Technology Disabled Intel EIST support Disabled I O Device Configuration KBC Clock Input 12MHz Serial port A Enabled Base I O address Serial port A 3F8 Interrupt Serial port A IRQ 4 Serial port B Enabl
19. UP or DOWN port physical state and the link width in terms of transfer rate v enable verbose mode which includes all sysfs supported parameters for the port interface and port Syntax ibstatus h devname port Examples e To display status of all IB ports enter ibstatus e To display status of mthcal ports enter ibstatus mthcal e To show status of specified ports enter ibstatus mthcal 1 mthca0 2 Output example for a mthca dual port HCA Infiniband device mthca0 port 1 status default gid fe80 0000 0000 0000 0008 f104 0397 7ca5 base lid 0x0 sm lid 0x0 state 1 DOWN phys state 2 Polling rate 2 5 Gb sec 1X Infiniband device mthca0 port 2 status default gid fe80 0000 0000 0000 0008 f104 0397 7ca6 base lid 0x2d sm lid 0x3 state 4 ACTIVE phys state 5 LinkUp rate 10 Gb sec 4X 2 18 BAS5 for Xeon Maintenance Guide 2 4 1 2 ibstat Command ibstat works in a similar fashion to the ibstatus utility but is implemented as a binaries and not a script and is more useful than ibstatus as more detailed information is provided It includes options to list Channel Adapters and or Ports Syntax ibstat d ebug I ist_of_cas plorts_list s hort lt ca_name gt portnum ibstat command examples e To display status of all IB ports enter ibstat e To display status of mthcal ports enter ibstat mthcal e To show status of spec
20. XmtConstralntErtolsSten idas rss 0 RevConstrarinthrrors t isc ica aceite 0 LinkIntegrityErrors 0 EXCBULOVEFLUNEILOLS i vii as 0 VEL SDP OP DSC eses ainia daa dra 0 MIMEBY COS She oie erie ee eile been aida Saat 458424 REV BY COS ta udwis E E Seno DA 1908363 POMC PSS oda co aed ae beets 6367 ROVPKRES os ance det is Sede da ida wae ee 41748 3 2 3 ibnetdiscover and ibchecknet ibnetdiscover is used to scan the topology of the subnet and converts the output into a human readable form Global IDs node types port numbers port Local IDs and NodeDescriptions are displayed The full topology is displayed including all nodes and links with the option of highlighting those which are currently connected The output may be printed to a topology file Syntax ibnetdiscover options lt topology filename gt Non standard flags 1 List of connected nodes H List of connected HCAs S List of connected switches ibchecknet uses a topology file which has been created by ibnetdiscover to scan the network validating the connectivity and reporting errors detected by the port counters The command runs as follows ibchecknet A sample output is displayed below warn counter SymbolErrors 65535 threshold 10 warn counter LinkRecovers 26 threshold 10 warn counter LinkDowned 16 threshold 10 warn counter RevErrors 21 threshold 10 warn counter RcvSwRelayErrors 54810 threshold 100 warn
21. and Acronyms Lists the Acronyms used in the manual Preface i Bibliography Bull HPC BASS for Xeon Installation and Configuration Guide 86 A2 87EW Bull HPC BASS for Xeon Administrator s Guide 86 A2 88EW Bull HPC BASS for Xeon User s Guide 86 A2 89EW Bull HPC BASS for Xeon System Release Bulletin 86 A2 64E J NovaScale Master Remote HW Management CLI Reference Manual 86 A2 88EM Bull Voltaire Switches Documentation CD 86 A2 79ET StoreWay Optima 1250 Quick Start Guide 86 Al 52EW StoreWay Optima 1250 Installation and User Guide 86 Al 53EW StoreWay Master User Guide 86 A2 38ET StoreWay Master Installation Guide 86 A2 37ET For clusters which use the PBS Pro Batch Manager PBS Professional 9 0 Administrator s Guide on PBS Pro CD ROM PBS Professional 9 0 User s Guide on PBS Pro CD ROM Highlighting Commands entered by the user are in a frame in Courier font Example mkdir var lib newdir Commands files directories and other items whose names are predefined by the system are in Bold Example The etc sysconfig dump file Text and messages displayed by the system to illustrate explanations are in Courier New font Example BIOS Intel Text for values to be entered in by the user is in Courier New Example COM1 Italics Identifies referenced publications chapters sections figures and tables lt gt identifies parameters to be supplied by the user Example lt node_name
22. available from the Bull support site Follow the instructions provided with the CD 44 Reconfiguring the BMC on R4xx machines The BMCs are configured in the factory before the machines are delivered However it may be necessary to reconfigure the BMC to setup a new IP address or when the firmware is updated Follow the steps below to do this 1 Install the update bmc fw rpm onto the machine 2 Configure the LAN and SOL access to the BMC with the default user name administrator and default password administrator For the local BMC of the machine run the command bmc_init_param b lt BMC IP address gt m lt BMC net mask gt For a remote BMC on a machine accessible through SSH run the command bmc_init_param b lt BMC IP address gt m lt BMC net mask gt s lt remote machine IP gt 4 6 BAS5 for Xeon Maintenance Guide Chapter 5 Updating the firmware for the InfiniBand switches Voltaire switches should be properly configured to ensure maximum performance For example Voltaire switch firmware version 00 08 06 ASIC does not utilise Double Data Rate transfer for those links which include Mellanox cards and should be upgraded The Voltaire switch firmware upgrade procedure is described below 5 1 Checking which Firmware Version is running Go to the utilities menu as follows ssh enable switchname enable switchname s password voltaire Welcome to Voltaire Switch switchname Connecting
23. done by coping all the configuration files from the Management Node to the I O node in question by using the scp command as shown below scp etc lustre conf lt fs_name gt xml lt io_node_name gt etc lustre conf lt fs_name gt xml lt fs_name gt is the name for each file system that was included on the I O node before the crash lustre_util info This command provides detailed information about the current distribution of the OSTs MDTs The services and their status are displayed along with information about the primary secondary and active nodes BASS for Xeon Maintenance Guide tmp log lustre lustre_HA ddmm log This file provides a trace of the commands issued by the nodes to update the LDAP and ClusterDB databases This information should be compared with the actions performed by CS5 ES Note In lustre_HA ddmm log dd specifies the day and mm the month of the creation of the file var log lustre HA DBDaemon yy mm dd log This file provides a trace of any ClusterDB updates that result from the replication of LDAP This could be useful if Lustre debug is activated at the same time 3 6 2 On the Nodes of an I O Pair The following tools must be run from the I O nodes ioshowall This command allows the configuration to be checked Look at the etc cluster cluster conf file for any problems if the following error is displayed Cannot connect to lt PAP address gt or HWMANAGER Check if the node is an
24. example ibchecknet is dependent on ibnetdiscover ibchecknode ibcheckport and ibcheckerrs READ ONLY PROGRAMS READ WRITE PROGRAMS Figure 3 1 OpenIB Diagnostic Tools Software Stack Use the following command to launch the diagnostic tools openib diags ibstatus ibtracert and ibdoctor a tool developed by Bull are described in chapter 2 Day to Day Maintenance Operations Some of the more useful troubleshooting tools are described below 3 2 1 smpquery Subnet Manager Query smpquery includes a subset of standard SMP query options which may be used to bring up information in a human readable format for different parts of the network including nodes ports and switches The basic syntax for the command is as follows smpquery options lt op gt lt dest_addr gt op_params nodeinfo example An example of use of this command including the Local ID and the port number is below smpquery nodeinfo 45 1 Troubleshooting 3 5 The resulting information output will be similar to that displayed below Base Ve ires ee it as 1 ClassVerSiicinsisnciss 1 NOS T Pi ii da Channel Adapter NUMP OLT S a a 2 SY STEMGUIIA da 0x0008f10403977ca7 CUE A RA AA at 0x0008f10403977ca4 ches ail C01 Eo iia aa ea eee ree a 0x0008f10403977ca6 PAE ios 64 Devin ansia id 0x5a04 REVES TON aia a di a ee 0x000000al EScal P OTE ci ai 2 Vendo Le diana arado o Ox0008f1 portinfo example An exam
25. file yy specifies the year mm the month and dd specifies the day of the creation of the file BASS for Xeon Maintenance Guide var log syslog This file provides a trace of the events and activity of CS5 and Lustre Recovering consistent state of HA system In some very specific cases it may be necessary to reset the HA system to a state which ensures consistency across the pair nodes without stopping the Lustre system 1 Disconnect the s1 Lustre File System from the HA system lustre_ldap unactive f fsl Now no operation on the HA system is passed on to the Lustre File System 2 Run storioha c stop clustat 3 Perform one of the following actions To move a node from primary state to pair node state run lustre_migrate export n lt node_name gt Or to reset the switched node back to its primary state run lustre_migrate relocate n lt node_name gt 4 Reconnect the Lustre File System to the Lustre HA system lustre_ldap active f fsl 5 Run storioha c start Troubleshooting 3 23 3 7 SLURM Troubleshooting 3 7 1 SLURM does not start Check that all the RPMs have been installed on the Management Node by running the command below rpm qa grep slurm The following RPMs should be listed slurm x x xx x Bull slurm auth none x x xx x Bull pam_slurm x x x x xx x Bull slurm auth munge x x xx x Bull gt Note The version
26. gt This command executes an Operating System OS command If the OS is not responding it is possible to use nsctrl poweroff_force lt node_name gt Wait for the command to complete A Check the node status by using nsctrl status lt node_name gt The node can now be examined and any problems which may exist diagnosed and repaired Stopping Starting Procedures 1 1 1 12 Restarting a Node To restart a node enter the following command from the management node nsctrl poweron lt node_name gt Note If during the boot operation the system detects an error temperature or otherwise the node will be prevented from rebooting Check the node status Make sure that the node is functioning correctly especially if you have restarted the node after a crash e Check the status of the services that must be started during the boot The list of these services is in the etc rc d file e Check the status of the processes that must be started by a cron command e The mail server syslog ng and ClusterDB must be working e Check any error messages that the mails and log files may contain Restart SLURM and the filesystems If the previous checks are successful reconfigure the node for SLURM and restart the filesystems 1 2 BASS for Xeon Maintenance Guide 1 2 Stopping Restarting an Ethernet Switch 4 Power off the Ethernet switch to stop it Power on the Ethernet switch to start it If a
27. gt ZN Warning A Warning notice indicates an action that could cause damage to a program device system or data canon A Caution notice indicates the presence of a hazard that has the potential of causing moderate or minor personal injury ii BASS for Xeon Maintenance Guide Table of Contents AEE Le O EEE E E TEE i Chapter 1 Stopping Starting Procedures ooooooooocccccccccccccononnanaancnnncncccnnnnnnnn noo 1 1 1 1 Stopping Restarting a Node at 1 1 tll Stopping a Node died 1 1 1 1 2 Restarting Mode a a 1 2 1 2 Stopping Restarting an Element 1 3 1 3 Stopping Restarting a Backbone Wilkins 1 3 1 4 Stopping Restarting the HPC Cluster tata 1 4 14i Stopping the HPC Ci aaa 1 4 1 4 2 Starting the HPC Cluster a ta 1 4 Chapter 2 Day to Day Maintenance Operations cccccccccccconnccccccnncccnnnnnnnnnnns 2 1 2 1 Maintenance Tools Overview arte 2 1 2 2 Maintenance Administration Told ai 2 2 2 2 1 Managing Consoles through Serial Connections conman ipmitool 2 2 2 2 2 Stopping Starting the Cluster nsclusterstop nsclusterstart ccessceesseeeseeeteeeteeeees 2 5 22 8 Managing hardware sl nidioanck osucewes Aihedincadinicattnandannctaamaneantes 2 7 2 2 4 Remote Hardware Management CLI NS Commands cceeeeeeereeeeeteeeeeseeeeeaees 2 8 225 Managing System Logs syslog ng mins 2 8 2 2 6 Upgrading Emulex HBA Firmware with Iptools cccccceeeeeeeee
28. inactive pair node if the following error appears otherwise start the node again service lustre_ha inactif clustat Displays a global status for Cluster Suite 4 from the HA cluster point of view Important If there is a problem the two pair nodes may not have the same view of the HA cluster state storioha c status This command checks that all the Cluster Suite 4 processes are running properly running state ES Notes e This command is equivalent to the following one on the Management Node stordepha c lt status gt i lt node gt e This command is included in the global checking performed by the ioshowall command Troubleshooting 3 21 stormap This command checks the state of the virtual links ES Note This command is included in the global checking performed by the ioshowall command Ictl dl This command checks the current status of the OST MDT services on the node For example 1 UP lov fsl1_lov e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885e 4 2 UP osc OSC_nova9_ost_nova6 ddn0 11_MNT_clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885 e 4 3 UP osc OSC_nova9_ost_noval0 ddn0 5_MNT_clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885e 4 4 UP osc OSC_nova9_ost_nova6 ddn0 3_MNT_clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885 e 4 5 UP osc OSC_nova9_ost_noval0 ddn0 21_MN
29. is being carried out a message similar to that below will appear Looking for program createdb using usr bin createdb Looking for program psql using usr bin psql Creating database ibsdb Done Loading table definitions into database ibsdb Done dbdelete To delete an IBS database ibsdb use the dbdelete command Only the postgres user is allowed to delete an empty database postgres admin S ibs a dbdelete While the command is being carried out a message similar to that below will appear Looking for program dropdb using usr bin dropdb Deleting database ibsdb Done dbpopulate Use the dbpopulate action to populate a new database In the example below data is supplied from the iswu0c0 O managed switch from the Management Node and the hostnames and traffic counters are populated using the OFED tools ES ibs s iswu0c0 0 a dbpopulate vN While the command is being carried out a message similar to that below will appear Connecting to switch iswu0c0 0 Done Sending request for file NetworkMap xml Done Getting response header from switch iswu0c0 0 Done Downloading Network ap xml Populating boards Populating switch chassis with boards Assigning ports to IB hosts Connecting ports Looking for program smpquery Updating hostnames using OFED smpquery Looking for program perfquery Updating port counters using OFED perfquery Assigning portcounters Updating s
30. is important to perform a check up of the Lustre HA file system This section describes the tools that allow you to make the required checks 3 6 1 On the Management Node The following tools must be run from the management node lustre _check This command updates the lustre_io_nodes table in the ClusterDB The lustre_io_nodes table provides information about the availability and the state of the I O nodes and metadata nodes lustre_migrate nodestat This command provides information about the node migrations carried out It indicates which nodes are supposed to support the OST MDT services In the following example the MDS are nova5 and nova9 the I O nodes are novas et noval0 nova5 and nova have been de activated so their services have migrated to their pair nodes nova9 and nova10 lustre_migrate nodestat HA paired nodes status node name node status HA node name HA node status nova5 MIGRATED nova9 OK nova6 MIGRATED noval0 OK ESP Note This table is updated by the lustre_check command lustre_migrate hastat En lt node_name gt This command indicates how the Lustre failover services are dispatched after CS4 software has been activated Each node has a view on the paired failover services the failover service dedicated to the node and the failover service dedicated to its pair node If the pair node has switched roles the owner column of the command output will show that this node supports t
31. numbers depend on the release and are indicated by the letter x above 3 7 2 SLURM is not responding 1 Run the command scontrol ping to determine if the primary and backup controllers are responding 2 If they respond then there may be a Network or Configuration problem see section 3 7 5 Networking and Configuration Problems 3 If there is no response log on to the machines to rule out any network problems 4 Check to see if the slurmctld daemon is active by running the following command ps ef grep slurmctld a If slurmctld is not active restart it as the root user using the following command service slurm start b Check the SlurmctldLogFile file in the slurm conf file for an indication of why it failed c If slurmetld is running but not responding a very rare situation then kill and restart it as the root user using the following commands service slurm stop service slurm start d If it hangs again increase the verbosity of debug messages by increasing SlurmctldDebug in the slurm conf file and restart Again check the log file for an indication of why it failed 3 24 BASS for Xeon Maintenance Guide 5 If SLURM continues to fail without an indication of the failure mode stop the service add the controller option c to the etc slurm slurm sh script as shown below and restart service slurm stop SLURM_OPTIONS_CONTROLLER c service slurm start 7 Note All ru
32. on each port on the host 3 0 3 Using INTEL_LMD_DEBUG Environment Variable Setting this environment variable will cause the application to produce product diagnostic information at every checkout Daemon Startup Problems Cannot find license file Most products have a default location in their directory hierarchy or use opt intel licenses server lic The environment variable INTEL_LICENSE_FILE names this directory Startup may fail if these variables are set wrong or the default location for the license is missing No such Feature exists The most common reason for this is that the wrong license file or an outdated copy of the file is being used Retrying Socket Bind This means the TCP port number is already in use Almost always this means an Imgrd intel is already running and you have tried to start it twice Sometimes it means that another program is using this TCP port number The number is listed on the SERVER line in the license file as the last item You can change the number and restart Imgrd intel but only do this if you do not already have an Imgrd intel running for this license file 3 28 BASS for Xeon Maintenance Guide INTEL cannot initialise INTEL FLEX1m version 7 2 lmgrd Please correct problem and restart daemons You may be starting the Imgrd intel from the wrong directory or with relative paths Use the following lines in the start up and add a full root path to INTEL to the end of
33. the VENDOR line in the license file cd lt installation directory gt pwd 1lmgrd intel c pwd server lic 1 pwd Ilmgrd intel log License manager cannot initialize Cannot find license file You have started Imgrd intel on a non existent file The recommended way to specify the file for Imgrd intel to use lt lt license gt cd lt installation directory gt pwd 1lmgrd intel c pwd server lic 1 pwd lmgrd intel log Invalid license key inconsistent encryption code for FEATURE This happens for 3 different reasons 1 The license file has been typed in incorrectly Cutting and pasting from email is a safe way to avoid this Or the data have been altered by the end user See Entering License File Data above 2 The license is generated incorrectly Your vendor will have to generate a new license if this is the case 3 The license vendor has changed encryption seeds rare MULTIPLE vendor daemon name servers running There are 2 Imgrd and vendor daemons running for this license file Only one process per vendor daemon per node is allowed to run Sometimes this can happen because the Imgrd was killed with a 9 signal which should not be done The Imgrd was then not able to bring the vendor daemon process down so it s still running although not able to serve licenses If Imgrd is killed with a 9 the vendor daemons also then must be killed with a 9 signal In general Imdown should be used Ve
34. the server is using a different copy of the license file than the application They should be synchronized This error will also report UNSUPPORTED in the debug log file Invalid Host You may be attempting to run the application on a host not listed in the HOSTID field of your license Use Imhostid to find the hostid number for the current host Cannot find license file No such file or directory Expected license file location lt path gt The application was not able to find a license file It gives you the location s where it was looking for a license file Check that the named file exists To use a file at a different location use the environment variable INTEL_LICENSE_FILE No such Feature exists The license manager cannot find a FEATURE line in the license file BASS for Xeon Maintenance Guide Feature has expired Your license has expired The system time may be set incorrectly Run the date command to make sure the date is not later than the Expiration Date listed in the license file lt FEATURE name gt Invalid inconsistent license key The license key and data for the feature do not match This usually happens when a license file has been altered See Entering License File Data above System Bootup Problems For reasons unknown some bootup files etc rc sbin rc2 d etc refuse to run Imgrd with the simple commands indicated above Here are two workarounds 1 Use nohup su u
35. update the port counters for an existing IBSDB database Use the command below ibs a dbupdatepc vNE availability Use the availability action to see which ports and links are available for the InfiniBand interconnects This action will not work unless the IBSDB database has been created and populated ibs s iswu0c0 0 a availability This will give results in a similar format to that below Active ports 74 Active uplinks 16 Active downlinks 21 Return Values IBS returns O for success Any other value indicates a failure Day to Day Maintenance Operations 2 29 2 4 3 Monitoring Voltaire Switches switchname Different options exist for monitoring and maintaining the performance of Voltaire switches To begin with enter the utilities menu as follows user host ssh enable switchname enable switchname s password voltaire Welcome to Voltaire Switch switchname Connecting switchname utilities switchname utilities 24 31 Resetting the counters The counters volume and errors can be reset through the zero counters command as follows switchname utilities zero counters zero All Counters Zero lid 8 port 255 mask Oxffff ea 243 2 Finding bad ports The find_bad_ports command can be used to detect faulty ports switchname utilities find _bad_ports Found bad link port Pode GULA Vet a it 0008 10400411946 MODES AES E ia a ase wsar aw anien
36. use_ time _recv no Local time will be used instead of the time written in the logs tgc_idle_threshold 100 The garbage collector is started after 100 events if syslog ng is inactive gc_busy_threshold 100 The garbage collector is started after 3000 events if syslog ng is active y Day to Day Maintenance Operations 2 9 source Section The source section defines the log source from the following network local files peripheral pipe stream Syntax source lt identifier gt source driver params source driver params etc For example the following lines are suitable for a Linux system They enable the dev log stream to be read and also to receive syslog ng internal messages and to handle kernel starting messages source src unix stream dev log internal file proc kmsg y Possible sources are as follows unix stream lt filename gt Stream pipes used in Linux file lt filename gt File data Linux kernel messages for example pipe lt filename gt Named pipes for interfacing with Nagios for example tcp lt ip gt lt port gt and udp lt ip gt lt port gt To listen on an address and a port internal syslog ng internal messages destination Section This section defines the destination of the logs Syntax destination lt identifier gt destination driver params destination driver params etc The possible destinations are the foll
37. with the l flag the specified file will be used as the input file IBS command actions topo The topo action for the a option provides detailed topology details for the switch lbs s lt switch_name gt a topo NE This will give output that includes a description of the switches the hostnames the GUID for the Nodes the LID for the Nodes the physical location of the switches The port details including any errors are shown in the bottom half of the screen for both local ports and for ports which are connected to remotely see the screen example on the next page Day to Day Maintenance Operations 2 2 1 Vespeddosps ria e spaddopg TI z peddosps ty Tp se 2s Tp Tps tpu 0 10 0 1OYS 0 DOYS 0 1OYS 0 1OYS 0 10YS 0 1OYS 0 TOYS 3 200Y8 31 2908 DIZADYS DITADYS 1 p se 25 tpwx TI sp4e25 tpw A Z Spse2s tpqwx A Z Spse2s tpqwx ix z peddosps TA xz peddoupsT A l l l i T spse2stpqwx l l l i syowa NOTLY307 8 10 YY 12 Y 2 Y 2 Y 2 Y 2 2 Y er er UY UY t Ur BZ ZADWS T Y IZIZTADWS T Y IZIZADWS T Y TOGOIG Tesora Tesora Teoeza Tesora TOGOIG TOGOIG TOGOIG 316070 3100x0 30000 atara 16070 Eaoera pageXG 0 g30ne t 0 g30nest Q gagnast 0 gonest 0 ggnast G ggnast 0 g30nest 0 gagnast 6Tsnez atanz tsnaz
38. 3 19 lustre_check command 3 19 lustre HA DBDaemon log file 3 21 lustre HA ddmm log 3 22 lustre HA ddmm log file 3 21 lustre_Idap 3 23 lustre_migrate 3 23 lustre_migrate hastat command 3 19 lustre_migrate nodestat command 3 19 lustre_util command 3 20 M macros use in file names 2 11 maintenance tools 2 1 MegaCLl tool 6 1 MegaRAID card firmware update 6 1 Mellanox card 5 1 memory_size 2 42 mkCDrec 2 15 mkcdrec Config sh file 2 16 N nb_cpu_total 2 42 nec_admin command 3 16 nec_admin conf file 3 16 NovaScale R421 BIOS settings 7 5 NovaScale R421 E1 BIOS settings 7 8 NovaScale R422 BIOS settings 7 10 NovaScale R422 E1 BIOS settings 7 13 NovaScale R423 BIOS settings 7 16 NovaScale R440 BIOS settings 7 21 NovaScale R444 BIOS settings 7 19 NovaScale R460 BIOS settings 7 23 nsclusterstart command 1 4 2 5 nsclusterstop command 1 4 2 5 nsctrl command 1 1 1 2 2 7 O openib command 3 5 P partition sizing See mkCDrec perfquery command 3 8 phpPgAdmin interface 1 3 pingcheck command 2 36 postbootchecker 2 42 power state getting information 2 36 printk code 2 41 R Remote Hardware Management CLI 2 8 restore fs sh script 2 15 restoring system See mkCDrec restoring the system 2 15 S saving the system 2 15 SINFO command 1 1 SLURM troubleshooting 3 24 smpquery command 3 5 SOL Serial Over Lan 2 4 starting Backbon
39. 3S30 QP42350 2 seqezep wos 501 y2ates Burzepdn Qp42350 seqezep wos sesseuppe JI yuswdinbs 6utzepdn gp4e3sn 2 seqezep wo uotzest e gt 0 3uswdtnb Gurzepdy ZEPS 3504 2307 3504 uo qp4 35n gt seqezep 03 Burz32uno S423un02340d Butubrssy Aaenbysed 0340 Gutsn s4 3uno gt 340d Burzepda Aaenbjsed weaGoud 104 Surmoo7 Asanbdus gado Butsn s weuzsoy 6ut4epdn Asanbdus wes6oud 404 Gutx00 gt 7 s3J0d Burz32uno 53504 gI 03 s340d Gutubtssy Spseog yate sis eyo yes Gutze ndoy spae0q Butzeqndog 350y gI ut3e 1 wx deyysoegey 5utpeoqurog Q Q29NMET YI THE wos J pe y uod gt J 6413339 qui deyysomgey 2713 404 3s nb s Gutpuss B G2GN St Y 10 03 Butz3 uuoy Q Q2QNeET S sqr g 25n2203004 yap repueq e aN Figure 2 2 Example of IBS command bandwidth action output 2 24 BASS for Xeon Maintenance Guide Msi 2sn gt 20300 Day to Day Maintenance Operations 2 25 uog quiz dey 10324 Gutpeojuseg 2uog 0 030NAST YI tHE wos Jepesy ssuodss4 6utz339 puog we deyysoegey 2714 404 3s nb Gutpuss puog 0 030NAST y ztas 03 Butz uuo 510442 R JNA Q QI0NMST S Sqt s zsn z 3001 z paddopgI A S pa8 gt 35Tpqux ol 3 04 2 Y stooxo ri 7 T o20n t 00070 BZ TADWS 1 Y pogoxo TI ti Tisnaz ESparosipqux 8 8 DOw 2 tooto rl til z 0r0nest el X 20DW8 UY zoeoore 11 TI sisnez T pausopqut z peddoipstys gz pse2stpyux l 0 DOWS 2 Y
40. 4835 1PQUIx 100 0 OZ TADWS T gogoro 11 1 gsn z 0 DOYW 2 Teoexe s s a mat S T pse2stpwex 0000 8 DOWW 2 Y tooxo It it z 0a0nas rt Q tHove 2 tooo et et O g29NM T GC T peusopyut T spse2stpqwx ol 3 04 2 Y store ot or T g2gnest a tHove 2 tooorxe ot orl 0 ognast T sp4e2stpawx 0000 8 DOWW z tooro st st z 0a0nast 0 DOWWY 2 tooore zi zi a ognast YD T pausopxuty 9 pe gt 3s tpw 8 Y DAOWY T 400070 11 T snaz 0 TOYY Z Tee070 rl r o as9nasi CO Tapauaspu T a5 434033 UT Es IAE TAG yBE O V ZTADWS T B000 0 11 1 senaz 0 DOY 2 tooox0 Bl 8 0 030NAET T paumopqut Z spse2stpywx ol 3 04 2 Y store st st T o29n st l 0 0 2 Y tooorxe st st ognast CE T pausopyuty Z spse2stpywx ol 3 04 2 Y stooro Ft Ft T ga0nest l a W3WYY z v toooxe r l rl g gagnasT y t pe gt stpwx ol 8 DOW 2 Y ctooxe zt at Z a20Ne t 0 DOWWS 2 Y tooore oz oz Q g2gnest 2 T spse2stpawx z000 3 04 2 stooxo E 6 T o29n st 0 DOYWS 2 Teoexe 6 E o ggnast Q 28424032 UT S 5P 2038 tpqwE 9 PaUAOpqUT i HZ 2NDWY T z ooxo tl 1 ssn z O I 2 Teoexe 31 x 0 0r0nmst E T paueopyuty Z spse2stpqwx ol 3 MOWWS Z atoro El El T g29nest 0 TOYY Z Ttoooxo EI El O g20N t O T paueopyut Z spse2stpywx o
41. 704 OPZOGYSI 433105 SPGTIPOOPOTF SEGGOXO DPGTIPOOPOTIBOGOzO ZI zt laos xr RySZIPOGPOTFEGGOXO oz a bors z gogoro z ar0ne t 24183104 OPZOGYSI 43315 SPETIFGGFGTIBGGGTG SP6TIPOGPOIIEGGOXO IT 11 I 9 o s x ErSzIrooralssooe70 GT a bors 2 Y gogoro Z o gt 0neat 24183704 GPZGEUST 423105 SPGTIPOGPOTI SOOOXO SPETIPOGPOTFBGGOxO GT ot l9aS xr RySZIPOGPOTFEGGOIO ST azy 2 Y seeeza z a gt 9ne t 2413704 OPZOGYSI 423175 SPGTIPOOPOTISOOOXO SPGTIPOOPOIISOGOXO 6 le Hoos xr RPSZIPOGPOTFSGGOXO LT D D 2 Y 6o00x6 T 020na 1 24123704 OPZOGYSI 433175 2ISZIPOOPOTFSGGOXO PISZIPOOrOI SOG0xO 31 5T des xr RPSZIPOGPOTFSGGOXO 9T gt 0 DOYS 2 Y 6000ze 1 00ne 1 24183104 GPZGEUSI 433175 gt IEZIPOGPOTFEOGOXO gt TEZIPOGPOTFEGG0xO ST st l9do0s xr RySZIPOGPOTFEGGOXO ST gt o boYs 2 Y Gogora 1 0 gt 0nea1 24183104 GPZGEUST 493105 gt ISZIPOGPOIISOOOXO gt TEZIPOGPOTIBGGOXO FT T Ilsosix RPSZIPOGPOTFEGGOXO FT 3 004 2 Y gogoro 1 0 gt 9neat 24183104 GPZGEUSI 423105 PISZIPOOPOIISGOOXO gt TEZIPOGPOTIBOGOxO ET ET 19GS xr EPSZIPOOPOIFEO0OXO El o boYs 2 Y gogara T 020ne t 24123704 OPzZOGYSI 423175 2ISZIPOOPOTFSOGOXO PISZIPOOPOT SOG0xG ZT zt oes xr RPSZIPOGPOTFSGGOIO ZT DZY 2 Y 6o00xa T g2Qnest 24123104 GPZGEYSI 423195 PISTTPGGPGTJSGG0xG 2TSZIPOGPOTIB
42. A ADA ADA 11 03345 3409Y z Y NOTLY307 6000xG ISZIPOOrOIFSOOOXO aradon gt T8Z 1 0090 145000X0 2182190090 145000X0 15ZI 00r0145000X0 182 1P00P0 14300010 2187 190040 145000X0 182 190090 14300010 182 140090 14500010 21582 190090 14500010 2187 19000 145000X0 2182 1400P0145000X0 182 140070 143000XG 182 190040 14500010 182 190090 14300010 9182 140090 145000X0 182 190040 145000X0 9300N 10d 01N9 1NOd E peuropyury E peueopyuty Z aaou T ppauropyuT Tapauropyu t 9 pausopqut Syu0yy3a W307 QINDICON SWYNISOH at St I El zl t aN a NId tu0 T g29nest 24183704 OPZOGYSI NOILdTY3530 l 2 gogora 2 ar0ns r l 4315 st st llos xp l SPSZTIPOOPOTISGOOOXO Pz a bors zZ Y gogoro Z o gt 0near 24183104 OPZOGYSI 423105 SPGTIPOGPOTI SOOOXO SPETIPOORGTHSGGG G ST st Ilsosix RPSTIPOGPOTFBGGOIO EZ az 2 Y gogara z ggnest 24123704 OPZOGYSI 423175 SPGTIPOOPOTFSGGOXO SPETIFGOROTIBGGGZG FT FT aS xr RPSZIPOGPOTFBGGOXO ZZ a bors 2 Y gogora z a gt 9ne t 24123704 OPZOGYSI 423105 SPGTIPOGPOTF SOOOXO DPETIPOGPOTFBOGOxO ET eET 19aS xr RySZIPOGPOTFSGGOXO TZ 8 DOYs 2 Y gagazo z o gt 0nest 24183
43. BASS for Xeon Maintenance Guide REFERENCE 86 A2 90EW 00 Q A L HPC BASS for Xeon Maintenance Guide Hardware and Software April 2008 BULL CEDOC 357 AVENUE PATTON B P 20845 49008 ANGERS CEDEX 01 FRANCE REFERENCE 86 A2 90EW 00 The following copyright notice protects this book under Copyright laws which prohibit such actions as but not limited to copying distributing modifying and making derivative works Copyright Bull SAS 2008 Printed in France Suggestions and criticisms concerning the form content and presentation of this book are invited A form is provided at the end of this book for this purpose To order additional copies of this book or other Bull Technical Publications you are invited to use the Ordering Form also provided at the end of this book Trademarks and Acknowledgements We acknowledge the rights of the proprietors of the trademarks mentioned in this manual All brand names and software and hardware product names are subject to trademark and or patent protection Quoting of brand and product names is for information purposes only and does not represent trademark misuse The information in this document is subject to change without notice Bull will not be liable for errors contained herein or for incidental or consequential damages in connection with the use of this material Preface Intended Readers This guide is intended for use by qualified personnel in charge of m
44. BIOS setup section parameter value Flow Control Noe Console connection Direct Continue C R after POST a ee Hardware Monitor CPU Temperature Threshold 750C Fan Speed Control Modes 1 Disable Full spe IPMI System Event Logging Enabled Clear System Event Log Disabled SYS Firmware Progress Disabled BIOS POST Errors Enabled BIOS POST Watchdog Disabled OS boot Watchdog Disabled Timer for loading OS min 10 Time out action No Action Security Supervisor Password ls Clear User Password ls Clear Password on boot Disabled Boot Managing the BIOS on NovaScale R4xxx Machines 7 7 7 3 3 NovaScale R421 El BIOS Settings motherboard S54005F R421 El BIOS EA BIOS setup section parameter value Main Quiet Boot Post Error Pause Disabled System Time Serial ATA Enabled Advanced Processor Configuration Enhanced Intel Speedstep Enabled Core Multi Processing Enabled Intel R Virtualization Technology Disabled Intel VT for Directed I O Disabled Simulated MSI support Disabled Execute Disable Bit Disabled Hardware Prefetcher Enabled Adjacent Cache Line Prefetch Enabled IOAT2 enable Enabled Processor Retest Disabled Memory Configuration Memory RAS amp performances Memory RAS configuration RAS Disabled Snoop Filter Enabled FSB High Bandwith Optimisation Enabled ATA Configuration Onboard PATA Controller Enabled Onboard SATA Controller Enabled SATA Mode Enhanced AHCI Mode Disabled Configure SATA as
45. CEDEX FRANCE info frec bull fr Technical publications ordering form To order additional publications please fill in a copy of this form and send it via mail to BULL CEDOC 357 AVENUE PATTON Phone 33 l 2 41 73 72 66 B P 20845 FAX 33 0 2 41 73 70 66 49008 ANGERS CEDEX 01 E Mail srv Duplicopy bull net FRANCE _ _ 1 1 The latest revision will be provided if no revision number is given NAME DATE COMPANY ADDRESS PHONE FAX E MAIL For Bull Subsidiaries Identification For Bull Affiliated Customers Customer Code For Bull Internal Customers Budgetary Section For Others Please ask your Bull representative BULL CEDOC 357 AVENUE PATTON B P 20845 49008 ANGERS CEDEX 01 FRANCE REFERENCE 86 A2 90EW 00
46. Configuring syslog ng syslog ng is installed on the cluster using the default configuration The scripts used to transfer log files are also installed The administrators can modify the default configuration according to their needs The etc syslog ng syslog ng conf file contains the configuration parameters for syslog ng This file is divided into five sections options section General options source section Source events destination section Log destinations filter section Filter definitions log section Actions to be performed on messages options Section Any general parameters may be configured in the options section An example is below Start of options area options sync 0 Number of events before writing in the logs time_reopen 10 Wait 10s before reconnecting if the connection failed Used when logs are centralized through network time_reap number Closes a log file that is not accessed after number seconds log_fifo_size 1000 number of event lines stored before writing them Enables events to be taken quickly into account and to free the process that has generated them long_hostnames off Usage of long names use_dns no Usage of DNS to find addresses use_fqdn no Usage of machine short name owner root logs owner group root logs group perm 644 logs rights mask keep_hostname yes create_dir yes Create directories for log storage
47. Fabric installation and during startup e Before running an application e Performance problems by locating discarded packets and link integrity problems e MPI job run problem to locate malfunctioning nodes and get the overall fabric structure e Additional problems related to fabric stability blocking or other 3 10 Debugging Tools Tools available to perform diagnostic e Use the Topology Map to see current problems e The Error Log e The Bad Ports Log e The Current Alarms Table e The Fabric Statistics portcounters csv file 3 1 4 High Level Diagnostic Tools 1 Enable the SM Fabric Inspect preferences for debugging Fabric Failure 2 Use the VFM VDM Port Counters Information and Graph window to check a specific port counter s health 3 Use the Event Log to discover that there is a problem in the fabric In the VFM right click and select View Event to get information to help identify where problem is located Alternatively you can show the Event Log from the CLI 4 Use the Current Alarms Table to see current problems In the VFM right click and select Alarm Data to get information to help identify where the problem is located 5 Use the Topology Map to identify nodes with a current alarm 6 Proactively look for increasing error counters using the statistics feature and running the Diagnostic scripts using the CLI gt Note See the Voltaire Switch User Manual ISR 9024 ISR 9096 and ISR 9288 2012 Switche
48. GGOXO IT Ir l9aS xr RPSZIPOGPOTFSGGOXO TI gt 9 DOYs 2 Y Gogora 1 0 gt 0ne 1 24183704 OPZOGYSI 433105 gt TSZIPOOPOIFEGOOXO gt TEZIPOGPOTIEGGOzO GT ot Ilsosix BySZIPOGPOTFEGGOXO GT 3 004 2 Y Gogora T a2gneet 24183104 OPZOGYSI 423tS gt ISZIPOOPOIFEO00XO gt ISZIPOOPOTFEOGOXO 6 16 I 9 o s x l RPSZIPOOPOTI SODOXO 6 spse2stpawx z p ddoJPST A 1 200Y8 1 Y otooxa ssnez SOHTUTJUT BIZSZIM Y2 0Z6qPZ00ZO6ZO0OXO IZGGPZOOZOGIZODOXO 1 Li Ilo9os xr Z pouropyut SpszIPO0rOIFSGGOXO 3 spse2stpawx z peddospst 4 0 ZY YY T Y Eteoxa penaz SOHTUTZUT BIZSZIW Y SLGIPZOOZOGIZODOXO GLGGPZOOZOGIZOGOXO 1 It Iloos xr T peuropyut Bpsztragratjsaaara KZ 2xD7y UY ZIeerG 9 naz T H 9snez IY ARGQPZOOZOGZODOXO PEGGPZOOZOGIZODOXO T II Hoaoes xr RpSZIPOGPOTFSGGOXO 9 szfexovy TY tigoro gsn z T H Esnez 1 gt ESQPZOOZOGIZOOOXO PESAPZONZOGAZOGOXO 1 It laos xr RySZIPOOPOTFEGGOXO S y oa T Y 500030 gsnez T H geez 1 0SGIPZOGZOGIZOGOXO TEGIPZOOZOGAZOGOXO 1 It Iloos xr RPSZIPOOPOTFEGGOXO z peddoups ts Z spse2stpwwx CZ ZMIVY T Zeooxa zsnez SOHTUTJUT BIZSZIM Y PIEJPZOOZOGZODOXO SIGIPZOOZOGIZODOXO 1 It IEKE TE 8rsztraoratssooere E syowya NOILYJ07 QI7300N AHYNASOH NOILA T9530 SdAL 0IN930ONAYOS 01N9AYOS I I 0334S HLOIA SYOMNS SSQONINOd OINDIMOd NI4 INOS
49. ISODATE YEAR MONTH DAY HOUR MIN SEC FULLHOST HOST Some examples are below destination full file dev tty12 file var log full_ DAY SMONTH SYEAR log owner root group adm perm 0640 y destination hosts file var log HOSTS SHOSTS SFACILITY SYEAR SMONTH S DAY SFACILITYSYEAR SMONTHSDAY owner root group adm perm 0600 dir_perm 0700 create_dirs yes ES Note Do not forget to remove or archive older files regularly Day to Day Maintenance Operations 2 1 1 filter Section This section describes the filtering mechanism for events Syntax filter lt identifier gt expression The filters are defined by the following keywords facility facility facility To filter by type level pri pril pri2 pri3 To filter by priority or level program regexp To filter by the name of the program that has generated the message host regexp To filter by the regular expression of the name of the host that has sent the message match regexp To filter by a regular expression filter filtername To use another filter All keywords may be used several times The expressions can contain the AND OR and NOT operators Examples filter f_iptables match IN OUT MAC filter f snort match snort filter f_ full not filter f_snort AND NOT filter f_iptables filter f messages level info warn AND
50. In the navigation bar enter the URL http lt BMC 1P addr gt Accessing Updating and Reconfiguring the BMC Firmware on NovaScale R4xx machines 4 3 4 2 Updating the BMC Firmware on NovaScale R421 R422 R422 El and R423 machines These platforms use the BMC SIMSO or SIMSO add on boards for platform management Both boards provide IPMI 2 0 functions The SIMSO board provides additional KVM over LAN functionality The BMC firmware and the tool needed to carry out the upgrade are included on the following RPM update bmc fw lt BMC firmware version gt Bull x86_64 rpm The BMC firmware of the SIMSO board can be updated under Linux using the updatefw x86_64 command To update the BMC firmware on the local machine do the following 1 Install the update bmc fw lt fw version gt rpm onto the machine 2 Start the IPMI service if it has not already been started service ipmi start 3 Run the command below updatefw x86_64 f usr local firmware lt firmware gt bin Where lt firmware gt is ubsim lt BMC FW version gt for a SIMSO board ugsim lt BMC FW version gt for a SIMSO with KVM board 4 To initialize the Sensor Date Repository SDR on the local machine sdrload usr local firmware lt platform gt sdr dat Where lt platform gt equals either r421 r422 for NovaScale R422 and R422 El machines or R423 To update the BMC firmware on a remote machine do the following 1 Install t
51. M considers to be in a DOWN state and check to see if the slurmd daemon is running using the following command ps ef grep slurmd 4 If slurmd is not running restart it as the root user using the following command service slurm start 5 Check SlurmdLogFile file in the slurm conf file for an indication of why it failed a If slurmd is running but not responding a very rare situation then kill and restart it as the root user using the following commands service slurm stop service slurm start 6 If the node is still not responding there may be a Network or Configuration problem see section 3 7 5 Networking and Configuration Problems 7 If the node is still not responding increase the verbosity of debug messages by increasing SlurmdDebug in the slurm conf file and restart Again check the log file for an indication of why it failed 8 If the node is still not responding without an indication as to the failure mode stop the service add the daemon option c to the etc slurm slurm sh script as shown below and restart service slurm stop SLURM_OPTIONS_DAEMONS c service slurm start 7 Note All running jobs and other state information will be lost when using this option oa Networking and Configuration Problems 1 Use the following command to examine the status of the nodes and partitions sinfo all 2 Use the following commands to confirm that the control daem
52. NOT facility auth authpriv mail news log Section In this section you define how the messages will be processed using source destination and filters commands defined in the previous sections Syntax log source s1 source s2 filter F1 filter F2 destination d1 destination d2 flags flag1 flag2 Examples log source src filter f_news filter f_notice destination newsnotice y log source src destination full y 2 12 BASS for Xeon Maintenance Guide 2 2 6 2 2 6 1 2 2 6 2 Upgrading Emulex HBA Firmware with Iptools Iptools is a set of two utilities for upgrading Emulex HBA firmware These two utilities are e Iputil low level tool used to interact with Emulex HBA e Ipflash high level script used to upgrade firmware of a set of Emulex HBA Emulex driver Ipfc module has to be loaded when using Iptools check with Ismod Firmware updates are available from Emulex Web site On a node you can get the current FW level from all the Emulex HBA using the Isiocfg tool getting information about storage devices Warning Be sure that FC devices are not being used when upgrading the Emulex HBA firmware lputil This low level tool should not be used in standalone mode Please refer to on line help when using this tool Ipflash Ipflash flashes Emulex HBAs with the specified firmware file Ipflash may be used to upgrade in one sh
53. Note that the default vsftpd pam config also checks etc vsftpd ftpusers for users that are denied root bin 3 Start the vsftpd server as follows root host service vsftpd start Starting vsftpd for vsftpd OK 4 Check that FTP is working correctly root host ftp host Connected to host 220 vsFTPd 2 0 1 530 Please login with USER and PASS 530 Please login with USER and PASS KERBEROS_V4 rejected as an authentication type Name host root root 331 Please specify the password BASS for Xeon Maintenance Guide 5 2 2 Password 230 Login successful Remote system type is UNIX Using binary mode to transfer files ftp gt quit 221 Goodbye Configuring the FTP server options for the InfiniBand switch Enter the FTP configuration menu as follows ssh enable switchname enable switchname s password voltaire Welcome to Voltaire Switch switchname connecting switchname config switchname config ftp switchname config ftp The following settings define the node 172 20 0 102 as the FTP server The switch logs onto this server using Joe s account using the yummy password switchname config ftp server 172 20 0 102 switchname config ftp username joe switchname config ftp password yummy Once FTP is set up on the switch make sure the FTP server is running on the Management Node ftp host If ftp fails to connect t
54. RAID Disabled Configure SAS as SW RAID Disabled Serial Ports Configuration Serial A Enable Enabled Address 3F8 IRQ 4 Serial B Enable Enabled Address 2F8 IRQ 3 USB Configuration USB Controller Enabled Legacy USB Support Enabled Port 60 64 emulation Disabled Device reset Timeout 20s Storage Emulation Auto USB 2 0 Controller Enabled PCI Configuration Memory mapped I O start addr 2 00GB Memory mapped I O above 4GB Disabled Onboard video Enabled Dual Monitor Video Disabled Onboard NIC1 ROM Enabled Onboard NIC2 ROM Disabled 1 0 Module NIC ROM Disabled Intel IOAT Enabled System accoustic amp Perf Throttling mode Closed Loop Security Administrator password Not Installed 7 8 BASS for Xeon Maintenance Guide BIOS setup section parameter value User Password Not Installed Front panel lockout Disabled Server Management Assert NMI on SERR Enabled Assert NMI on PERR Enabled Resume on AC Power Loss Laststate Windows hw error architecture Enabled FRB 2 Enable Enabled OS boot Watchdog Disabled BMC PLUG amp Play detection Disabled Console Redirection Console Redirection Serial B Flow Control None Baud Rate 115 2k Terminal Type TOO Legacy OS Redirection Disabled Boot Options Boot Timeout 0 Boot Option 1 Boot Option 2 Boot Option 3 Boot Option 4 Hard Disk Order hard disk 1 hard disk 2 hard disk 3 network device 1 network device 2 Network Device Order
55. T_clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885 e 4 6 UP osc OSC_nova9_ost_nova6 ddn0 19 MNT_clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885e 4 7 UP osc OSC_nova9_ost_noval0 ddn0 7_MNT_clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885 e 4 8 UP osc OSC_nova9_ost_nova6 ddn0 1_MNT_clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885e 4 9 UP osc OSC_nova9_ost_noval0 ddn0 23_MNT_clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885 e 4 10 UP osc OSC_nova9_ost_nova6 ddn0 17_MNT_clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885e 4 11 UP osc OSC_nova9_ost_noval0 ddn0 13_MNT_clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885 e 4 12 UP osc OSC_nova9_ost_nova6 ddn0 9 MNT_clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885e 4 13 UP osc OSC_nova9_ost_noval0 ddn0 15_MNT_clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885 e 4 14 UP mdc MDC_nova9_mdt_nova5 ddn0 25_MNT_clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885e 4 The last line indicates the state of the MDC which is the client connecting to the MDT on the MDS The other lines indicate the state of the OSC which are the clients connecting to each OST on the nova and novaio OSS var log lustre HA_yy mm dd log This file provides a trace of the calls made by CS5 to the Lustre failover scripts ES Note In the HA_yy mm dd log
56. Verbose Mode v Option Some of the storage commands have a v verbose option which provides more output information during the processing of the command See Bull HPC BASS for Xeon Administrator s Guide for an inventory of storage commands supporting the v option 3 4 1 2 Log Trace System Principle If the verbose mode is not enough a system of traces can also be configured to obtain more information on some commands To activate these traces you can set the trace level in the appropriate etc storageadmin conf file There are two lines in these files to set the trace These lines look as follows where lt command_name gt is the name of the command to debug lt command_name gt _TRACE_STDOUT_LEVEL lt command_name gt _TRACE_LOG_FILE_LEVEL The first line is used to activate traces on stdout the second one is used to generate traces in a tmp storregister PID traces log file By default the two lines are in comment gt Note It is recommended to use this trace tool only for temporary debugging because there is no automatic cleaning of the tmp lt command_name gt PID traces log files Four levels of traces are available e 4 gt TRACE_LEVEL_DEBUG e 3 gt TRACE_LEVEL_INFO e 2 gt TRACE_LEVEL_WARNING e gt TRACE_LEVEL_ERROR Level 4 is the most verbose level level 1 traces only error messages La Note It is not possible to add new commands All the commands
57. a physical state change for the port specified This is useful when the active width speed of a specific port must be changed without the cable being reconnected Syntax portmanage sh v Ef lt d e r gt lt LID gt lt PORT gt Day to Day Maintenance Operations 2 31 Options V d lid port e lid port r lid port S lid port D lid port h Example Increase output verbosity level Force disabling or resetting a port even when the port is located on the Access Path path way to the specific port Disable the port Enable the port set port state machine to polling state Reset the port Reset the port and set Enabled Speed to SDR Reset the port and set Enabled Speed to SDR DDR Show this help port manage sh r 17 21 reset LID 17 PORT 21 2 32 BASS for Xeon Maintenance Guide 2 4 4 Getting Information about Storage Devices Isiocfg Isiocfg is a tool used for reporting information about storage devices lt is mainly dedicated to external storage systems DDN and FDA disk arrays and their dedicated Host Board Adapters Emulex FC adapters but it can also be used with internal system storage system disks and their Host Board Adapters tools Reported information is related to several inventories e Host Board Adapters c flag e Disks d flag e Disk partitions p flag e Disk usages Syntax According to needed information Isiocfg can be used with options related to each inventor
58. a remote work station connected to a development machine configured as a DHCP server Day to Day Maintenance Operations 2 41 2 6 2 6 1 2 6 1 1 26 12 2 6 1 3 Testing Maintenance Tools Checking Nodes after Boot Phase postbootchecker postbootchecker detects when a Compute Node is starting and runs check operations on this node after its boot phase The objective is to verify that CPU and memory parameters are coherent with the values stored in the ClusterDB and if necessary to update the ClusterDB with the real values Prerequisites e syslog ng must be installed and configured as follows Management Node management of the logs coming from the cluster nodes Compute nodes detection of the compute nodes as they start e The postbootchecker service must be installed before the RMS service to avoid any disturbance for the jobs postbootchecker Checks for the Compute Nodes The postbootchecker service etc init d postbootchecker detects every time a Compute Node starts Whilst the node is starting up postbootchecker runs three scripts to retrieve information about processors and memory These scripts are the following Script name Description procTest pl Retrieves the number of CPUs available for the node memTest pl Retrieves the size of memory available for the node modelTest pl Retrieves model information for the CPUs available on the node Then postbootchecker returns this in
59. accepting this system of traces are listed in the corresponding conf file mp See Bull HPC BASS for Xeon Administrator s Guide to identify the right configuration file 3 14 BASS for Xeon Maintenance Guide 3 4 1 3 Example The following example explains how to obtain log file and or stdout traces on storregister command Find the right etc storageadmin conf file to modify In the case of the storregister command it is storframework conf because of the presence of these two lines storregister_TRACE_STDOUT_LEVEL LOG_FILE_LEVEL storregister_TRACE Edit the storframework conf file Uncomment one of the two previous lines Choose a level of trace between 1 lowest and 4 highest level For example to add traces of debug level 4 highest level on stdout only the storframework conf file must contain the following lines STDOUT trace level configuration storregister_TRACI E_STDOUT_LEVEL log file trace level configuration storregister_TRACE 3 Save the storframework conf file LOG_FILE_LEVE L 4 Relaunch storregister New traces will appear on the stdout Available Troubleshooting Options for Storage Commands The following table sums up the available troubleshooting options for the storage commands Command User v option Log Trac
60. achines It also defines the recommended settings for the BIOS parameters for these machines 7 1 Updating the BIOS on NovaScale R421 R422 R422 El and R423 This section describes how to update the motherboard BIOS of a NovaScale R421 R422 R422 El or R423 machine Install the bios lt platform gt lt bios version gt rpm corresponding to your platform and to the new BIOS release The corresponding BIOS DOS image lt BIOS gt IMG is installed in usr local firmware y woning e Ensure that the BIOS version corresponding to your platform is used e The BIOS upgrade MUST NOT be interrupted whilst it is in course of operation e If the BIOS does not work a new BIOS chip must be ordered To install a new BIOS locally 1 Copy the lt BIOS gt IMG file onto an USB key dd if usr local firmware lt BIOS gt IMG of dev sd lt your USB device gt 2 Insert the key and reboot the machine The autoexec file contained in the DOS file automatically starts the BIOS update Wait for the BIOS installation to finish 3 Remove the USB key 4 Restart the machine To install a new BIOS on a remote machine using PXE gt Note The remote machine must be configured to boot via PXE on the server The server must be configured as a TFTP server 1 Install the update bios rpm on the server Managing the BIOS on NovaScale R4xxx Machines 7 1 2 Ifthe remote machine is accessible using IPMI run this command on the server up
61. age Troubleshooting 3 5 Lustre Troubleshooting 3 6 Lustre File System High Availability Troubleshooting 3 7 SLURM Troubleshooting 3 8 FLEXIm License Manager Troubleshooting 3 1 Troubleshooting Voltaire Networks 3 1 1 Voltaire s Fabric Manager Voltaire s Fabric Manager enables InfiniBand fabric connectivity debugging using the built in Performance Manager PM PM has two major capabilities Port Counters Monitoring and Report The PM generates a periodic port counters report file in CSV format that can be loaded to Excel and further analyzed by the user It also monitors port counters errors and reports every port that passes its error threshold limit as configured by the user Event Logging This creates an event log file for both IB traps and SubNet internal events The user may filter the events using a GUI and or a CLI The filtering policy determines whether an event is logged and whether a trap is generated It is essential to identify any problem ports and node connectivity problems prior to running application as well as during standard operation Er Note See the Voltaire Switch User Manual ISR 9024 ISR 9096 and ISR 9288 2012 Switches for details on how to configure and use Port Counters and the Performance Manager This manual also includes a description of all the PortCounter fields and counter values Troubleshooting 3 1 31 2 Fabric Diagnostics Diagnostic is recommended in the following cases e During
62. aintaining and troubleshooting the Bull HPC clusters of NovaScale R4xx nodes based on Intele Xeon processors Prerequisites Readers need a basic understanding of the hardware and software components that make up a Bull HPC cluster and are advised to read the documentation listed in the Bibliography below Structure This guide is organized as follows Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Stopping Restarting Procedures Describes procedures for stopping and restarting Bull HPC cluster components Day to Day Maintenance Operations Describes how to undertake different types of maintenance operations using the set of maintenance tools provided with Bull HPC clusters Troubleshooting This chapter aims to help the user develop a general comprehensive methodology for identifying and solving problems on and off site Updating the BMC Firmware on NovaScale R421 R422 Describes how to update the BMC firmware on NovaScale and R421 and R422 systems Updating the firmware for the InfiniBand switches Describes how to update the firmware for the MegaRAID card Updating the firmware for the MegaRAID Card Describes how to update the Voltaire switch firmware Managing the BIOS on NovaScale R4xxx Machines Describes how to update the BIOS on NovaScale R421 and R422 machines It also defines the recommended settings for the BIOS parameters on NovaScale R4xxx machines Glossary
63. atus of a file system is CRITICAL according to the lustre_util status command and if the file system needs to be re installed for instance if some nodes of the cluster have been deployed and reconfigured it is possible that the file system description needs to be removed from the cluster management database as shown below 1 Run the following command to install the s1 file system lustre_util install f etc lustre models fs1 1lmf The command may issue an output similar to file system already installed do remove first 2 Run the following command to remove the fs1 file system lustre_util remove f fsl The command may fail with a message similar to file system not loaded try to give the full path If it is not possible to re install neither remove the file system with force option F Troubleshooting 3 17 The lustre_fs_dba command can then be used to remove the file system information from the cluster management database For example to remove the s1 file system description from the cluster management database enter the following command lustre_fs_dba del f fsl After this command the file system can be re installed using the lustre_util install command 3 18 BASS for Xeon Maintenance Guide 3 6 Lustre File System High Availability Troubleshooting Before using a Lustre file system configured with the High Availability HA feature or in the event of abnormal operation of HA services it
64. cape character can be changed to amp to prevent conflicts with ssh e Use the ESC and the number 2 keys instead of using the F2 key to access the BIOS on NovaScale R440 and R460 machines e Use the ESC and the minus keys instead of using the DEL key to access the BIOS on NovaScale R421 and R422 machines 4 1 2 3 Web remote access The BMC can be accessed using a web interface for Novascale R421 R422 R422 El and R423 machines gt See the Bull NovaScale R42x AOC SIMSO SIMSO Installation and User s Guide for more information 4 2 BAS5 for Xeon Maintenance Guide The Web interface provides access to the SOL console or the KVM console SIMSO and also the means to access virtual devices for maintenance purposes To access the BMC of a remote machine through the Web interface 1 The following RPMs found in the BONUS directory on the Bull XHPC DVD must be installed on the Management Node XHPC BONUS jre lt version gt linux i586 rpm XHPC BONUS firefox lt version gt Bull 0 i386 rpm These are installed by running the commands below cd release XBAS5V1 1 XHPC BONUS rpm i jre lt version gt linux 1586 rpm firefox lt version gt Bull 0 i386 rpm 2 The java plug in should be configured using Firefox ln s usr java jrel lt version gt plugin i386 ns7 libjavaplugin_oji so usr local firefox plugin 3 The remote BMC is accessed using the command below usr local firefox firefox 4
65. ceeeeeeeeeeteeeeeteeeeaes 2 13 2 3 Saving and Restoring the System mkCDrec oooooooocccnoccccoooccconocccoonccoonnnononnncconnncconnnccnnnos 2 15 2 3 1 Configuring mk Dei 2 16 Za Credtingia Backup didas 2 16 2353 Restoring a Systemie an a a A E a ERa 2 17 2 4 Monitoring Maintenance Tool ida 2 18 2 4 1 Checking the status of InfiniBand Networks ibstatus ibstat oooonoonnccnnnnccoc o 2 18 2 4 2 Diagnosing InfiniBand Fabric Problems IBS tool ooooonoonononinnninncnnccnnccnnoccconnonnos 2 20 2 43 Monitoring Voltaire Switches switchname ccsscceesreeeteeeeeeeeseeeeeeeeseeenseentreeees 2 30 2 44 Getting Information about Storage Devices Isiocfg s ccesceeeeeeeereeeteeeneeeteeeees 2 33 2 4 5 Checking Device Power State pingeheek eric iia 2 36 2 5 Debugging Maintenance Tools an cba 2 37 2 5 1 Modifying the Core Dump ao 2 37 2 5 2 Identifying InfiniBand Network Problems ibdoctor ibtracert oooooonnccnnnccnnncccooo 2 37 2 5 3 Using dump tools with RHEL5 crash proc kdump ooooocooocinoconoccnioncconcconncccnnonnos 2 40 2 5 4 Identifying problems in the different parts of a kernel oooooooooonccccnnccccoooccccocccninncconoo 2 41 2 6 Testing Maintenance Tools kent enn eee ee ten one eee eee ad 2 42 2 6 1 Checking Nodes after Boot Phase postbootchecker cccccceeceeeeeeeeeesteeeetteeeeaes 2 42 Chapter 3 dice Mo 1 0 Tolo 1 1 ORAR ARA PA o E o A 3 1 3 1 Troub
66. counter XmtDiscards 65535 threshold 100 Error check on lid 2 port all FAILED warn counter RcvSwRelayErrors 3995 threshold 100 Error check on lid 2 port 4 FAILED Checked Switch nodeguid 0x0008f104004118d8 with failure Checking Ca nodeguid 0x0008 10403979970 Checking Ca nodeguid 0x0008f10403979860 Troubleshooting 3 9 3 2 4 3 10 Checking Ca nodeguid 0x0008f104039798ec Checking Ca nodeguid 0x0008f1040397996c Checking Ca nodeguid 0x0008f104039798e8 Checking Ca nodeguid 0x0008 10403979910 Checking Ca nodeguid 0x0008f104039798e4 Checking Ca nodeguid 0x0008f10403979920 Checking Ca nodeguid 0x0008f10403979948 Checking Ca nodeguid 0x0008f104039798f4 Checking Ca nodeguid 0x0008f104039798d0 Checking Ca nodeguid 0x0008f10403977ca4 Summary 13 nodes checked 0 bad nodes found 24 ports checked O bad ports found 1 ports have errors beyond threshold ibcheckwidth and ibcheckportwidth ibcheckwidth checks all nodes using the complete topology file which was created by ibnetdiscover to validate the bandwidth for links which are active and will also identify ports with 1X bandwidth ibcheckwidth Output Example Summary 40 nodes checked 0 bad nodes found 140 ports checked 0 ports with 1x width in error found ibcheckportwidth checks connectivity and the link width for a given port lid and will indicate the actual bandwidth being used by the port Thi
67. d switch oooonoonnnnnnonnninnccicocccnno 5 3 5 3 Upgrading thefirmwatresensrr aa ecto A A V A Cue Ls 5 4 Chapter 6 Updating the firmware for the MegaRAID card occoonoocccncccncccccccc n 6 1 Chapter 7 Managing the BIOS on NovaScale R4xxx Machines 00000m 7 1 7 1 Updating the BIOS on NovaScale R421 R422 R422 El and R423 ooococooccccccoccccococcconnccno 7 1 7 2 Updating the BIOS on NovaScale R440 or R460 oooooococooocccooccccononoconocononcccoonncnonnnccnnnnnnnns 7 3 7 3 BIOS Parameter Settings for NovaScale Rxxx Nodes ooooooccccoocccoooococoonccoocccoonncconnnncnanncnons 7 4 Zi Examples nin aee A A EET pe Pan AA 7 4 7 3 2 NovaScale R421 BIOS SiS as ida 7 5 7 3 3 NovaScale R421 E1 BIOS Settings vado 7 8 7 3 4 NovaScale R422 BIOS Sets o a 7 10 7 3 5 NovaScale R422 E1 BIOS Settings de 7 13 7 3 6 NovaScale R423 BIOS e Dt ron 7 16 7 3 7 NovaScale R440 SATA BIOS MM o OS 7 19 7 3 8 NovaScale R440 SAS BIOS SH sind 7 21 7 3 9 NovaScale A A pee thes usta lutea tM soles Maeda sta 7 23 Glossary and ATOM Mr ira G 1 A O 1 Table of Contents v List of Figures Figure 2 1 Figure 2 2 Figure 2 3 Figure 3 1 Figure 7 1 Figure 7 2 Example of IBS command topo action aut ccoo ridad 2 22 Example of IBS command bandwidth action OU PUt oooocoocccionococoocccoonnccconnncnonccinnnncnnns 2 24 Example of IBS command errors action OUtDUt ccccccceeeeceeeeeceeeeeeeeeeeeeceeeeteaeeeeeeeees 2 25 Op
68. date bios lt remote IP address gt usr local firmware lt BIOS gt IMG lt BMC IP address gt or if the server can connect to the remote machine using ssh then run this command update bios lt remote IP address gt usr local firmware lt BIOS gt IMG 3 The update bios command returns after the BIOS update is completed on the remote machine Usage update bios lt ipaddr gt lt bios image gt lt bmc ipaddr gt lt user name gt lt user passwd gt ipaddr gt network address of remote machine to have BIOS update bios image local path to the BIOS DOS image file bmc ipaddr BMC address of remote machine user name BMC user name user passwd BMC user password To install a new BIOS on a remote machine using the Web interface R421 R422 R422 El and R423 On the R421 R422 R422 El and R423 platforms it is possible to access the BMC through the Web interface see Chapter 4 From the administration node 1 Start the Firefox navigator usr local firefox firefox 2 In the navigation bar type the URL of the remote BMC http lt BMC 1P addr gt and login to the BMC 3 Select the Virtual Media button and upload the BIOS image usr local firmware lt BIOS gt IMG corresponding to the machine A Select the Console Button to access the console of the remote system 5 Restart the remote system The BIOS DOS image will boot and flash the new BIOS The progression can be followed in the co
69. dee eed ee 13 Oper VES cil a a oes VLO 7 ParChn Once ln miis Ga WO weed ba vas 0 PartEnforceOutb cece eee eee 0 Fa PEGE RAWLOD amp or peepi coer a lea ois 0 FilterRawOutb 206 0 KEVVIOLAELONSS rd do ales ee 0 PkeyViolations 6 0 OKGVVLOLACI ONS S ao e 0 GULAC Ses cet fade bb Ae esl aie aos 32 ClientReregister 6 0 Subnet Timeout snc 854 ese e 18 RESPTAMOV AL Etre aceite ipia iiie 1 THO Cad PHYS BPE S ws csncee a lalate ateos 15 OVER RUNES se ciea cede o es 0 MaxCreditHiNti ooooooooooooo 0 ROUDO TL Pai AA ee 0 switchinfo example An example of use of this command including the Local ID is below smpquery switchinfo 0x4 The resulting information output will be similar to that displayed below LINnGAath OD Cap seme apio a eer es 49152 RAN COME AD CAD rinitis es 0 CASTE CD CAD ms leed 1024 LANGALF AD TOD fe oie ee cide oe eee eee eee 46 DETRLOPE SS inci is Wake Gres 0 DefMcastPrimPotti o oooooooooo 0 DefMcastNotPrimPort 0 LASTIMA a a E E 15 StatechangS sree aa aras e 0 LEASP SER ORES ie e is de 0 PartEntorcoslapPiis nia latas 32 EINDOUNGPACCHME i ici a 1 OutboundPartEntE tii eae is e cado 1 FilterRawInbound 1 FilterRawlnbound cassas ede oo eine 1 ENnAANCCAP OM COS inci lo e aae wis 0 Troubleshooting 3 7 3 2 2 3 8 perfquery perfquery uses Performance Management General Service
70. e BIOS on NovaScale R4xxx Machines 7 21 BIOS setup section parameter value ACPI Redirection Port Disabled Baud Rate Flow Control Terminal Type VT100 Remote Console Reset Assert NMI on PERR Enabled Assert NMI on SERR Enabled FRB 2 Policy Retry 3 Times Boot Monitoring Disabled Boot Monitoring Policy Retry 3 Times Thermal Sensor Enabled BMC IRQ RQ 11 Post Error Pause Enabled AC LINK Last State Power On Delay Time 20 Platform Event Filtering Enabled Boot 7 22 AN OAK ON BASS for Xeon Maintenance Guide 7 3 9 NovaScale R460 BIOS Settings System part number N8100 1247E R460 BIOS 5S46 Motherboard Jumper settings JSASRAID2 1 2 RAID disable BIOS setup section parameter value System Date Hard Disk Pre Delay Disabled Processor Settings Processor Retest No Execute Disable Bit Disabled Intel R Virtualization Tech Disabled Enhanced Intel SpeedStep R Tech Disabled Language Advanced Memory Configuration Memory Retest No Extended RAM Step Disabled Memory RAS Feature Interleave Sparing Disabled PCI Configuration Onboard Video Controller yGA Controller Enabled Onboard VGA Option ROM Scan Auto Onboard LAN LAN Controller Enabled LAN1 Option ROM Scan Enabled LAN2 Option ROM Scan Enabled PCI Slot 1B Option ROM Enabled PCI Slot 1C Option ROM Enabled PCI Slot 2B Option ROM Enabled PCI Slot 2C Option ROM Enabled PCI Slo
71. e switch 1 3 Ethernet switch 1 3 HPC cluster 1 4 node 1 2 start restore sh 2 17 startrestore sh script 2 15 stopping Backbone switch 1 3 Ethernet switch 1 3 HPC cluster 1 4 node 1 1 storage Index 13 troubleshooting 3 14 storage device getting information 2 33 storageadmin conf file 3 14 storioha 3 23 storioha command 3 21 stormap command 3 22 switchname command 2 30 syslog 3 23 syslog ng 2 8 syslog ng conf file 2 9 system logs managing 2 8 T trace levels storage 3 14 4 BASS for Xeon Maintenance Guide trace log storage 3 14 troubleshooting FDA storage system 3 16 FLEXIm License Manager 3 28 Infiniband 3 5 Lustre HA 3 19 SLURM 3 24 storage 3 14 Voltaire 3 1 U ulimit command 2 37 updatefw x86_64 command 4 4 V Voltaire switch firmware 5 1 Voltaire Switches 2 30 Technical publication remarks form BASS for Xeon Maintenance Guide Reference 86 A2 90EW 00 Date April 2008 ERRORS IN PUBLICATION SUGGESTIONS FOR IMPROVEMENT TO PUBLICATION Your comments will be promptly investigated by qualified technical personnel and action will be taken as required If you require a written reply please include your complete mailing address below NAME DATE COMPANY ADDRESS Please give this technical publication remarks form to your BULL representative or mail to Bull Documentation Dert 1 Rue de Provence BP 208 38432 ECHIROLLES
72. e the following command pdsh w lt node list gt date A matter of a few seconds is inconsequential but SLURM is unable to recognize the credentials of nodes that are more than 5 minutes out of synchronization See Chapter 2 in the Bull HPC BASS for Xeon Installation and Configuration Guide for information on setting node times using the NTP protocol 3 7 6 More Information For more information on SLURM Troubleshooting see the Bull HPC BASS for Xeon Administrator s Guide Bull HPC BASS for Xeon User s Guide and http www Inl gov linux slurm slurm html Troubleshooting 3 27 3 8 FLEXIm License Manager Troubleshooting 3 8 1 Entering License File Data You can edit the hostname on the server line first argument the port address third argument the path to the vendor daemon on the VENDOR line if present or any right half of a string b of the form a b where a is all lower case Any other changes will invalidate the license Be cautious when transferring data received by Mailers Many Mailers add characters at the end of line that may confuse the reader about the real license data 3 8 2 Using the Imdiag utility The Imdiag command analyzes a license file with respect to the SERVER the FEATUREs license counts and dates lt may help you to understand problems that may occur Imdiag attempts to checkout all FEATUREs and explains failures You may run extended diagnostics attempting to connect to the license manager
73. ed Mode Normal Base I O address Serial port B 2F8 Interrupt Serial port B IRQ 3 DMI Event Logging Event Logging ECC Event Logging Console Redirection Com Port Address Baud Rate Console Type Flow Control Console connection Direct Continue C R after POST E Hardware Monitor Fan Speed Control Modes 2 3 pin Server IPMI System Event Logging Enabled Clear System Event Log Disabled SYS Firmware Progress Disabled BIOS POST Errors Enabled BIOS POST Watchdog Disabled OS boot Watchdog Disabled Timer for loading OS min 10 Time out action No Action Security Supervisor Password Is Clear User Password Is Clear Password on boot Disabled 7 14 BASS for Xeon Maintenance Guide BIOS setup section parameter Boot AN 00h WN Managing the BIOS on NovaScale R4xxx Machines 7 15 7 3 6 NovaScale R423 BIOS Settings mainboard X7DWN R423 BIOS 1 06 7DWNC217 BIOS setup section parameter value Main System Time System Date Legacy diskette A 1 44MB Parallel ATA Enabled Serial ATA Enabled SATA Controller Mode Option Enhanced SATA Raid enable Disabled SATA AHCI enable Disabled Advanced Boot Features QuickBoot Mode QuietBoot Mode POST Errors ACPI Mode Power Button Behaviour Resume On Modem Ring EFI os boot Power Loss Control Watch Dog Summary screen Yes InstantOff Off Disabled Last State Disabled
74. ed using the OFED tools ibs s iswu0c0 0 a dbupdate NE In order to ensure that the data is always up to date add the following line to the cron table using crontab e 10 PATH usr local ofed bin SPATH usr bin ibs s iswu0c0 0 a dbupdate vNE gt gt var log ibs log 2 gt amp 1 The traffic and error counters as well as the InfiniBand equipment stored in the IBS database will be refreshed every 10 minutes using the data supplied by the iswu0c0 0 switch ME Note The user needs to know which switch is running the subnet manager as master for InfiniBand clusters that include multiple managed switches This switch should always be the one that is specified as the argument of the s flag Assuming that the data is refreshed by the cron daemon then if another switch becomes the subnet manager master the data details contained in the database would then be incorrect as it would use data from what is the slave switch as defined in the cron script Use the sminfo command as follows to know which subnet manager is running as the master Output in a form similar to that below will be provided sminfo sm lid 1 sm guid 0x8f1040041254a activity count 544113 priority 3 state 3 SMINFO_MASTER The guid that is identified can then be used to find the corresponding switch name in the ibsdb chassis table 2 28 BASS for Xeon Maintenance Guide 2 4 2 3 dbupdatepc Use the dbupdatepc action to
75. enlB Diagnostic Tools Software Stack s5 10s caisiehacoscacsseassieecsnaneiocee seodeojeadenteeieesonewess 3 5 Example BIOS parameter setting screen for NovaScale R421 0oooooocoooccocococccocccccconcccnnos 7 4 Example BIOS parameter setting screen for NovaScale R422 o ooooooccocccococccccoccccconncccnnos 7 4 List of Tables Table 2 1 Table 3 1 Maintenance Tools ccccccsseecececceccececcecceccecacececcaucececcausecececueecececusecececutereescuteteeeeaees 2 1 Available troubleshooting options for storage commands cceeceeeeeteeeesteeeeteeeeaes 3 15 vi BASS for Xeon Maintenance Guide Chapter 1 Stopping Starting Procedures 1 1 Pa This chapter describes procedures for stopping and restarting Bull HPC cluster components which are mainly used for maintenance purposes The following procedures are described e 1 1 Stopping Restarting a Node e 1 2 Stopping Restarting an Ethernet Switch e 1 3 Stopping Restarting a Backbone Switch e 1 4 Stopping Restarting the HPC Cluster Stopping Restarting a Node Stopping a Node Follow these steps to stop a node 1 Stop the customer s environment Check that the node is not running any applications by using the SINFO command on the management node All customer applications and connections should be stopped or closed including shells and mount points 2 Un mount the filesystem 3 Stop the node From the management node enter nsctrl poweroff lt node_name
76. ered as failed If number of retries does not exceed the retry value the command is launched again otherwise it is failed cmdtimeout 300 See Bull HPC BASS for Xeon Administrator s Guide for more details about the nec_admin command BASS for Xeon Maintenance Guide 3 3 3 0 1 3 9 2 333 Lustre Troubleshooting The following section helps you troubleshoot some of the problems affecting your Lustre file system Because typographic errors in your configuration script or your shell script can cause many kinds of errors check these files first when something goes wrong First be sure your File system is mounted and you have mandatory user rights Hung Nodes There is no way to clear a hung node except by rebooting If possible un mount the clients shut down the MDS and OSTs and shut down the system Suspected File System Bug If you have rebooted the system repeatedly without following complete shutdown procedures and Lustre appears to be entering recovery mode when you do not expect it take the following actions to cleanly shut down your system 1 Stop the login nodes and all other Lustre client nodes Include the F option with the lustre_util command to un mount the file system lustre_util umount F f lt file_system gt n lt node_name gt 2 Shut down the rest of the system 3 Run the e2fsck command Cannot re install a Lustre File System if the status is CRITICAL If the st
77. es Name of the Command corresponding conf File fcswregister Yes iorefmgmt Yes ioshowall Yes Isiocfg Yes Yes Isiodev Yes nec_admin Yes Yes nec_admin conf nec_stat Yes stordepha Yes storcheck Yes Yes storframework conf stordepmap Yes Yes stordiskname Yes storiocellctl Yes Yes storframework conf storioha Yes Troubleshooting 3 15 3 4 1 4 3 16 Command User option Log Traces Name of the Command corresponding conf File storiopathctl Yes Yes storframework conf stormap Yes Yes stormodelctl Yes Yes storframework conf storregister Yes Yes storframework conf storstat Yes Yes storframework conf stortrapd No Yes storframework conf stortraps No Yes storframework conf Table 3 1 Available troubleshooting options for storage commands nec_admin Command for Bull FDA Storage Systems The nec_admin command is used to manage Bull FDA Storage Systems This command interacts with the FDA CLI A retry mechanism has been implemented to manage the fact that the CLI may reject commands when overloaded If despite default setting the nec_admin command occasionally fails you may change the timeout and retry values defined in the etc storageadmin nec_admin conf file Number of retries in case of iSMserver Busy Not Mandatory retry 3 If retry is set time in second between two retries Not Mandatory rtime 5 Timeout value when timeout is reached the command is consid
78. figuration on a local NovaScale R42x machine channel 1 run the command below ipmitool lan print 1 2 To obtain the BMC LAN configuration on a local NovaScale R440 or R460 machine channel 2 run the command below ipmitool lan print 2 Accessing Updating and Reconfiguring the BMC Firmware on NovaScale R4xx machines 4 1 4 1 2 Remote access to the BMC 4 1 2 1 Command Line Remote access The BMC of a remote node can be accessed using the ipmitool command man pmitool or the higher level cluster oriented conman or NS commands See Chapter 2 in this manual Examples using the ipmitool command 1 To obtain the BMC LAN configuration for a NovaScale R42x machine channel 1 ipmitool H lt BMC IP addr gt U ADMIN P ADMIN lan print 1 2 To shutdown a remote machine ipmitool H lt BMC IP addr gt U ADMIN P ADMIN power soft 3 To connect to a remote console via SOL for NovaScale R421 R422 R422 El R423 R440 and R460 machines ipmitool I lanplus H lt BMC IP addr gt U ADMIN P ADMIN sol activate Enter to terminate the connection 4 To connect to a remote console via SOL for a NovaScale R421 Elmachine ipmitool I lanplus H lt BMC IP addr gt U ADMIN P ADMIN o intelplus sol activate 4 1 2 2 Tips for using ipmitools and SOL e Ifthe payload is already active for another session it can be deactivated by running the ipmitool sol deactivate command e The es
79. formation to the Management Node using syslog ng postbootchecker Checks for the Management Node On the Management Node the postbootchecker server gets information returned from the Compute Nodes and compares it with information stored in the ClusterDB e The number of CPUs available on the node is compared with the nb_cpu_total value in the ClusterDB e The size of memory available on the node is compared with the memory_size value in the ClusterDB e The CPUs model type on the node is compared with the cpu_model value in the ClusterDB If discrepancies are found the ClusterDB is updated with the values retrieved In addition the Nagios status of the postbootchecker service is updated as follows e Ifthe discrepancies concern the number of CPUs or the memory size the service is set to CRITICAL e Ifthe discrepancies concern the model of the CPUs the service is set to WARNING If no discrepancies were found the service is OK 2 42 BASS for Xeon Maintenance Guide Chapter 3 Troubleshooting Troubleshooting deals with the unexpected and is an important contribution towards maintaining a cluster in a stable and reliable condition This chapter is aimed at helping you to develop a general comprehensive methodology for identifying and solving problems on and off site The following topics are described 3 1 Troubleshooting Voltaire Networks 3 2 Troubleshooting InfiniBand Stacks 3 3 Node Deployment Troubleshooting 3 4 Stor
80. gh to store the contents of a CD ROM DVD ROM Path of the temporary directory used by mkCDrec 1 or O Set O to create CD ROM backups or 1 to create DVD backups Maximum size of the created images in kbs Example 4200000 for DVD ROM 620000 for CD ROM Path of the directory used to store the ISO backups Ensure that this directory is large enough to store all the backups List of the directories and files to be saved in the backup Choose only what seems important to save in order to obtain a backup of a reasonable size Defines the architecture of the system to backup x86 ia64 etc Check that the value fits the system The configuration can be performed using the Webmin interface http hostname 10000 mkcdrec 2 3 2 Creating a Backup Perform these operations on the Management Node 1 Log on as root user in single mode 2 Stop the activity on the Management Node the ClusterDB must not be used during the backup operation 3 Go to the mkCDrec base directory by default this is var opt mkcdrec cd var opt mkcdrec 4 Check that the system is operational for mkCDrec make test 2 16 BAS5 for Xeon Maintenance Guide mkCDrec displays warning messages if it has detected that some elements are missing for the backup If this happens perform the appropriate corrections and restart make test until the test is successful 5 Launch the backup operation make A menu is dis
81. gs ksis server logs are saved on the Management Node in var lib systemimager overrides ka d server log and Ksis server traces are saved on the Management Node in var lib systemimager overrides server_log gt Note Traces are only possible for the ksis server and for client nodes if the ksis deploy command is executed using the g option ksis image client logs ksis client logs on the Management Node in var lib systemimager overrides imaging_complete_ lt nodelP gt or var lib systemimager overrides patching_complete_ lt nodelP gt or var lib systemimager overrides unpatching_complete_ lt nodelP gt and ksis client traces on the Management Node in var lib systemimager overrides imaging_complete_error_ lt nodelP gt These traces will only be logged if the deployment error occurs on the client side Patch deployment client traces on the Management Node in var lib systemimager overrides patching_complete_error_ lt nodelP gt or var lib systemimager overrides unpatching_complete_error_ lt nodelP gt The client log files will be used during the post check phase Ksis client and image server errors are compared in order to identify the source of any problems which may occur The trace files are kept for support operations Troubleshooting 3 13 3 4 Storage Troubleshooting This section provides some tips to help the administrator troubleshoot a storage configuration 34 1 Management Tools Troubleshooting 3 4 1 1
82. he two lustre_HA services In the following example nova and nova10 are paired I O nodes The lust re_nova service is started on nova10 owner node This status is consistent on both nova and noval0 nodes Troubleshooting 3 19 3 20 lustre_migrate hastat n nova 6 10 noval0 Member Status Quorate Group Member Member Name State nova6b Online noval0 Online Service Name Owner Last lustre_noval0 noval0 lustre_nova6b noval0 nova6b Member Status Member Name Quorat lustre_noval0 lustre_nova6b e Group Member State Online Online Owner Last noval0 noval0 ID 0x0000000000000001 0x0000000000000002 State started started ID 0x0000000000000002 0x0000000000000001 State started started To return to the initial configuration you should stop lust re_nova which is running on nova10 and start it on nova6 using the lustre_migrate relocate command lustre_util status This command displays the current state of the Lustre file systems A Important Sometimes this command can simply indicate that the recovery phase has not finished in this situation the status will be set to WARNING and the remaining time will be displayed A Important When an I O node have been completely re installed following a system crash the Lustre configuration parameters will have been lost for the node They need to be redeployed from the Management Node by the system administrator This is
83. he update bmc fw lt fw version gt rpm onto the local machine 2 Run the command below updatefw x86_64 i IP Address u ADMIN p ADMIN f usr local firmware lt firmware gt bin Where lt firmware gt is ubsim lt BMC FW version gt for a SIMSO board ugsim lt BMC FW version gt for a SIMSO with KVM board 4 4 BASS for Xeon Maintenance Guide 3 To initialize the SDR on the remote machine sdrload usr local firmware lt platform gt sdr dat lt BMC IP Address gt where lt platform gt equals either r421 r422 for NovaScale R422 and R422 El machines or R423 Usage updatefw x86_64 f Firmware File updatefw x86_64 i IP Address u Usr p Pwd f Firmware File sdrload lt SDR file gt lt bmc ipaddr gt lt user name gt lt user passwd gt SDR file SDR file provided by sdredit command bmc ipaddr The BMC address of remote machine If no address is provided the local SDR repository is updated user name BMC user name user passwd BMC user password To update the BMC firmware using the Web interface tl See the Bull NovaScale R42x AOC SIMSO SIMSO Installation and User s Guide for more information Accessing Updating and Reconfiguring the BMC Firmware on NovaScale R4xx machines 4 5 4 3 Updating the BMC firmware on NovaScale R440 and R460 machines The BMC update for these platforms is carried out using the Bull Update BIOS CD which is also used to upgrade the BIOS and FRUs and is
84. ified ports enter ibstat mthcal 2 e To list the port guids of mthcaO enter ibstat p mthca0 e To list all CA names enter ibstat 1 Day to Day Maintenance Operations 2 19 2 4 2 Diagnosing InfiniBand Fabric Problems IBS tool This tool is used from the Management Node to diagnose problems for InfiniBand fabric using the cluster switch topology information contained in the NetworkMap xml file and the error checking counters contained in the PortCounters csv file Alternatively an IBS database IBSDB containing all the switch information can be created and then used as the data source to diagnose the problems Command syntax ibs a lt action gt hvCNE s lt switch gt f lt networkmap gt lt lt counters gt The following options are available for the ibs command h Help file y Verbose mode C Disable colored text output a Action one of topo bandwidth errors config group dbpopulate availability dbcreate dbdelete dbupdate dbupdatepc OFED related options When working from the cluster Management Node and provided this node is fitted with an InfiniBand adapter that is connected to an InfiniBand interconnect it is recommended that the N and E options are used as the OFED software view of the cluster is more reliable than that provided by data taken directly from the switch N Query the IB subnet manager to obtain and update the hostname details E Que
85. ile system if the kernel permits it To increase or decrease the partition size with the help of the mkCDrec utilities gt Note mkCDrec is designed for system backups It is not the objective of mkCDrec to backup alll system data and it is recommended to regularly backup all your data using another method A typical example of usage is to run mkCDrec every night for a system and store the ISO images on another system via NFS In case of a problem it will be possible to burn the saved image onto a CD ROM DVD ROM and then to restore the system What follows is an overview about configuring and using mkCDrec For more information please refer to http mkcdrec sourceforge net Day to Day Maintenance Operations 2 15 2 3 1 Configuring mkCDrec The var opt mkcdrec Config sh file contains the configuration parameters for mkCDrec All parameters have a default value However it is recommended that the following values are checked either to verify that they fit your needs or to define your own values in order to generate a coherent but not too large system backup BURNCDR ISOFS_DIR TMP_DIR DVD_DRIVE MAXCDSIZE CDREC_ISO_DIR EXCLUDE_LIST BOOTARCH Y or N Y means that the CD ROM DVD ROM will be burned directly from the machine N means that ISO images of the CD ROM DVD ROM will be created Path of the temporary directory used before creating the ISO images Ensure that this directory is large enou
86. ine e To know more about the ipmitool command enter ipmitool h 2 4 BASS for Xeon Maintenance Guide 2 2 2 Stopping Starting the Cluster nsclusterstop nsclusterstart The nsclusterstop nsclusterstart scripts are used to stop or start the whole HPC cluster These scripts launch in sequence the various stages making it possible to stop start the cluster in full safety For example the stop process includes the following main steps e checking the various equipment e stopping the file systems Lustre for example e stopping the storage devices e stopping the nodes except the Management Node s nsclusterstop and nsclusterstart use two configuration files etc clustmngt nsclusterstart conf and etc clustmngt nsclusterstop conf files whose values can be changed The file option allows you to specify another configuration file These files define e the delay parameters between the different stages required to stop start the cluster e the sequence in which the group of nodes should be stopped started You can run dmbGroup show to display the configured groups Usage usr sbin nsclusterstop h f file lt filename gt usr sbin nsclusterstart h f file lt filename gt Options file lt filename gt f Specify a configuration file default etc clustmngt nsclusterstart conf or etc clustmngt nsclusterstop conf h Display nsclustersta
87. ing Disabled Enhanced x8 Detection Enabled High Bandwidth FSB Enabled High Temp DRAM OP Disabled AMB Thermal Sensor Disabled Thermal Throttle Disabled Global Activation Throttle Disabled 7 10 BASS for Xeon Maintenance Guide BIOS setup section parameter value Crystal Beach Feature Enabled Route Port 80h cycles to LPC Clock Spectrum Feature Disabled High Precision Event Timer No USB Function Enabled Legacy USB Support Enabled Advanced Processor Options Frequency Ratio Default Core Multi Processing Enabled Machine Checking Enabled Thermal Management 2 Enabled C1 Enhanced Mode Disabled Execute Disable Bit Enabled Adjacent Cache Line Prefetch Enabled Hardware Prefetcher Enabled Direct Cache Access Disabled Intel R Virtualization Technology Disabled Intel EIST support Disabled I O Device Configuration Serial port A Enabled Base I O address Serial port A 3F8 Interrupt Serial port A IRQ 4 Serial port B Enabled Mode Normal Base I O address Serial port B 2F8 Interrupt Serial port B IRQ 3 DMI Event Logging Event Logging ECC Event Logging Console Redirection Com Port Address Baud Rate Console Type Flow Control Console connection Direct Continue C R after POST Hardware Monitor CPU Temperature Threshold 750C Fan Speed Control Modes 2 3 pin Server IPMI System Event Logging Enabled Clear System Event Log Disabled SYS Firmware Progress Disab
88. ink See 2 2 1 2 Using ipmi Tools ES Note Storage Units may also provide console interfaces through serial ports allowing configuration and diagnostics operations Using ConMan The ConMan command allows the administrator to manage all the consoles including server consoles and storage subsystem consoles on all the nodes It maintains a connection with all the lines that it administers It provides access to the consoles and uses a logical name It supports the key sequences that provide access to debuggers or to dump captures Crash Dump ConMan is installed on the Management Node The advantages of ConMan on a simple telnet connection are as follows e Symbolic names are mapped per physical serial line e There is a log file for each machine e Itis possible to join a console session or to take it over e There are three modes for accessing the console monitor read only interactive read write broadcast write only Syntax conman lt OPTIONS gt lt CONSOLES gt b Broadcast to multiple consoles write only d HOST Specify server destination 127 0 0 1 7890 e CHAR Specify escape character amp f Force connection console stealing 2 2 BASS for Xeon Maintenance Guide F FILE Read console names from file h Display this help file Join connection console sharing FILE Log connection output to file L Display license information m Monitor connection read only q Query serve
89. irmware spine 2 path to firmware 5 4 BASS for Xeon Maintenance Guide gt Note Whenever a line board or a fabric board is replaced always ensure that is using the correct firmware 3 Check that the firmware has upgraded correctly by running the firmware_verify_anafa_Il command switchname utilities firmware_verify_anafa_II Updating the firmware for the InfiniBand switches 5 5 5 6 BASS for Xeon Maintenance Guide Chapter 6 Updating the firmware for the MegaRAID card The MegaRAID SAS driver for the 8408E card is included in the BAS5 for Xeon delivery The MegaRAID card will be detected and the driver for it installed automatically during the installation of the BAS5 for Xeon software suite The MegaCLI tool used to update the firmware for the MegaRAID card and is available on the Bull support CD The latest firmware file should be downloaded from the LSI web site Follow the procedure described below to update the firmware 1 Check the version of the firmware already installed by running the command opt MegaCli AdpAllInfo a0 This will provide full version and manufacturing date details for the firmware as shown in the example below Adapter 0 Versions Product Name MegaRAID SAS 8408E Serial No PO88043006 FW Package Build 5 0 1 0053 Mfg Data Mfg Date 01 16 07 Rework Date 00 00 00 Revision No of Image Versions In Flash Boot Block Version
90. itches which use 4 0 or later firmware versions ibs s lt switch_name gt vNE a config group This action generates the group csv file that includes the hostname mapping configuration details for all the switches this can then be imported into a switch in order to configure it For large clusters this is quicker than running the config action as detailed above to generate and import the cluster switch configuration details into a switch Note This option only applies to Voltaire switches which use version 4 0 or later firmware ibs s iswu0c0 0 a group While the command is being carried out a message similar to that below will appear Successfully generated configuration file group csv To update a managed switch proceed as follows Log onto the switch Enter the enable mod Enter the config menu Enter the group menu Type the following command group import home user path 2 4 2 2 IBSDB Database lt is possible to create a database which includes all the hardware and InfiniBand traffic details for all the switches with the IBS tool This database is specific to InfiniBand hardware The following commands apply to the IBSDB Database dbcreate To create an empty new IBS database ibsdb use the dbcreate command Only the postgres user is allowed to create an empty database postgres admin S ibs a dbcreate 2 26 BASS for Xeon Maintenance Guide While the command
91. ive width 4X rate 5 0 SR9024D Voltaire lid 0x2 port 6 guid 0008f10400411ld6a state Active width 4X rate 5 0 29 22 ibtracert Command ibtracert uses Subnet Manager Protocols SMP to trace the path from a source GID LID to a destination GID LID Each hop along the path is displayed until the destination is reached or a hop does not respond By using the mg and or ml options multicast path tracing can be performed between the source and destination nodes Syntax ibtracert options lt src addr gt lt dest addr gt Flags n Simple format no additional information is displayed m lt mlid gt Show the multicast trace of the specified mlid Examples e To show trace between lid 2 and 23 enter ibtracert 2 23 e To show multicast trace between lid 3 and 5 for mcast lid Oxc000 enter ibtracert m 0xc000 3 5 Output The output for a command between two points is displayed in both hexadecimal format and in human readable format as shown in the example below for the trace between the two lids 0x22 and Ox2c This is very useful in helping to identify any port switch problems in the InfiniBand Fabric 2 38 BASS for Xeon Maintenance Guide QQ 000 QAaaan QaAAaA aa bps bps bps bps bps bps bps bps bps bps bps bps bps bps bps bps bps ibtracert 0x22 0x2c gt From ca 0008 10403979958 portnum 1 lid 0x22 0x22 lynx13 HCA 1 1 gt switch po
92. l lid lid lid lid lid lid lid 1 5 4 3 2 11 guid 0 10 guid 0 lid 7 guid 00 guid guid guid guid guid 00 00 00 00 00 08 10400 08 10400 08 1 08 1 08 1 008 1040 008 1040 08 1 10400 10400 10400 10400 4004d7 ports 310723 ports 310722 ports 3f071f ports 3f071le ports 03f0747 ports 24 03f0746 ports 24 3f073b ports be printed 24 24 24 24 24 24 Troubleshooting 3 3 3 1 5 3 3 1 6 Air error find script The easiest way to look for errors on all ports in the fabric is to run the error find script It will report any non zero port counters found throughout the fabric on both switches and HCAs ISR9288 utilities error find Show All Counter Errors every error found will be printedlid 1 guid 0008 104004004d7 ports 24 lid 5 guid 0008 104003 0723 ports 24 port 22 REMItlALSCAT AS med ae a 4 p rt LO iakdowned iia aaa 1 port 13 lid 4 guid 0008f104003f0722 ports 24 port 14 CELS SY Mio ri a 83 Event Notification Mechanism Fabric related events can be generated by both the PM Performance Monitor and by the SM Subnet Manager The PM periodically scans the error counters of all IB elements in the fabric and reports if a counter exceeds its threshold The SM monitors the fabric detects configuration changes and dynamically configures the new elements and new routes in the fabric The SM can detect fabric
93. l 3 10 Z 8tooxo st 91 T ga0nest 0 DOYY 2 Y ToG0x0 st st G G20NMWET yx T paumopqut Z spse2stpywx ol 8 DOY 2 Y toro st s z 0r0neer l a WWY 2 Y tooore ez l 0 030NAE1 LLJ T pausopyuty Z spse2stpywx ol 8 DOY 2 Y toro et oat z 0r0nesr a tovy 2 to00ro Bt atl pamar T paumopqut Z spse2stpqwx ol 8 DOWW 2 Y ctooxe rt Ft z ont 0 DOWW 2 toooxe zz zz Q ggnast T peusopyut Z spse2stpqwx ol 8 DOW 2 Y tooto E 6 z 0a0nest a txovd 2 tooore et etl 0 020NAET oy T peumopxuty 4340334UT g sp4erstpywx zzs 0 D ZADWS 1 Y sooo 11 1 ysnaz G W 2 taooto l l O g29neEt i T sp4e2stpawx e200 8 DOWW 2 lo0ro eT I z as 0 DOYWY 2 Teooxe iz Tz a ognast ON T Spse25tpqwz ol 3 10 2 Y Btooxo It It T g29nN t 0 DOYY 2 tooto It It G ga q T Sp 4228 TPUUX Z 443A34 0000 g t 2 Y tooro z 0r0neer 3 ZW3WY UY stooo tI Ti zzsn z z paddopsI A p paeostpquix 000 0 3 04 2 stooro zl z T o29N t sceo IZ ZADWY T soooxe 11 TI gsnaz 2 E Spse2stpqwex ol 8 D0w 2 Y tooxo El E z 030nest 0000 ASZADWS T Y 5000x TI TI yrenz 0 z peddoupst A y spse2stpwe 0000 8 DOWW 2 Y tooxo s Ss z 0a0nest 0000 3 200W8 UY tooo Tel TI ozsnez LL T peusopyury z peddoipstys zt spse2stpax se
94. l condition Single bit AGB PCI Hole Granularity 256 MB Memory Branch Mode Interleave Branch O Rank Interleave 4 l Branch O Rank Sparing Disabled Branch 1 Rank Interleave 4 l Branch 1 Rank Sparing Disabled Enhanced x8 Detection Enabled High Bandwidth FSB Enabled High Temp DRAM OP Disabled AMB Thermal Sensor Disabled Thermal Throttle Disabled Global Activation Throttle Disabled Crystal Beach Feature Enabled Route Port 80h cycles to LPC Clock Spectrum Feature Disabled High Precision Event Timer No USB Function Enabled Legacy USB Support Enabled Advanced Processor Options Frequency Ratio Default Core Multi Processing Enabled Machine Checking Enabled Thermal Management 2 Enabled C1 Enhanced Mode Disabled Execute Disable Bit Enabled Adjacent Cache Line Prefetch Enabled Hardware Prefetcher Enabled Direct Cache Access Disabled Intel R Virtualization Technology Disabled Intel EIST support Disabled 1 O Device Configuration KBC Clock Input 12MHz Serial port A Enabled Base I O address Serial port A 3F8 Interrupt Serial port A IRQ 4 Serial port B Enabled Mode Normal Base I O address Serial port B 2F8 Interrupt Serial port B IRQ 3 Floppy disk controller Enabled Base I O address Primary DMI Event Logging Event Logging Enabled ECC Event Logging Enabled Console Redirection 7 6 Com Port Address Baud Rate Console Type BASS for Xeon Maintenance Guide AN OaKR WBN
95. l the required stages 1 From the management node run nsclusterstop 2 Stop the management node 1 4 2 Starting the HPC Cluster To start the whole cluster in complete safety it is necessary to launch different stages in sequence The nsclusterstart script includes all the required stages 1 Start the Management Node 2 From the Management Node run nsclusterstart lm See Chapter 2 details the nsclusterstop nsclusterstart commands and their associated configuration files 1 4 BASS for Xeon Maintenance Guide Chapter 2 Day to Day Maintenance Operations 2 1 Maintenance Tools Overview This chapter describes a set of maintenance tools provided with a Bull HPC cluster These tools are mainly Open Source software applications that have been optimized in terms of CPU consumption and data exchange overhead to increase their effectiveness on Bull HPC clusters which may include hundred of nodes The tools are usually available through a browser interface or through a remote command mode Access requires specific user rights and is based on secured shells and connections Function Administration Backup Restore Monitoring Debugging Testing ConMan ipmitool Purpose Managing Consoles through Serial Connection nsclusterstop nsclusterstart Stopping Starting the cluster nsctrl Remote Hardware Management CLI Managing hardware power on power off reset sta
96. led BIOS POST Errors Enabled BIOS POST Watchdog Disabled OS boot Watchdog Disabled Timer for loading OS min 10 Time out action No Action Security Supervisor Password Is Clear User Password Is Clear Password on boot Disabled Boot 1 O A 0 N Managing the BIOS on NovaScale R4xxx Machines 7 11 BIOS setup section parameter 7 12 BAS5 for Xeon Maintenance Guide VK ES NovaScale R422 El BIOS Settings motherboard X7DWT R422 El BIOS 1 0b 7DWTC217 BIOS setup section parameter value Main System Time System Date Serial ATA Enabled Native Mode Operation Serial ATA SATA Controller Mode Option Compatible Advanced Boot Features QuickBoot Mode QuietBoot Mode POST Errors ACPI Mode Yes Power Button Behaviour Instant Off Resume On Modem Ring Off EFI OS Boot Disabled Power Loss Control Last State Watch Dog Disabled Summary screen Disabled o Memory Cache Cache System BIOS area Write Protect Cache Video BIOS area Write Protect Cache Base 0 512k Write Back Cache Base 512k 640k Write Back Cache Extended Memory Area Write Back Discrete MTRR Allocation Disabled PCI Configuration Onboard G LAN1 OPROM Configure Onboard G LAN2 OPROM Configure Disabled Option ROM Re Placement Disabled PCI Parity Error Forwarding Disabled PCI Fast Delayed Transaction Disabled Reset Configuration Data No SLOT PCl Exp x16 Option ROM Scan Enabled Enable Master Enabled Latency
97. leshooting Voltaire Networks ani 3 1 3 1 1 Voltaire s Fabric Manager aio 3 1 3 1 2 Fabrie Diagnostics ida 3 2 Table of Contents iii SS Debugging TOO 8 id 3 2 3 1 4 High level Diagnosti TOO iaa ir 3 2 3 1 5 Gl Diagnosis dd A dead woe pment ols 3 3 3 1 6 Event Notification Mechanics tai 3 4 3 2 Troubleshooting InfiniBand Stacks A 3 5 3 2 1 SMPQUETY oi A 3 5 3 22 o yrrir i bias tary tacit aetenbe antennas tana etyaa A 3 8 3 2 3 ibnetdiscover and ibehecknetic A yds dade eee del SA ees 3 9 3 2 4 ibeheckwidth and ibcheckportwidthis lt csccs sy5cds seas adds 3 10 JZS More A A ns ial cared aes aad Atak owe al ats 3 11 3 3 Node Deployment Troubleshooting si rada 3 12 3 3 1 sde plo MENTA gs sesers di 3 12 3 3 2 Possible Deployment Problems iii da 3 12 3 4 Storage MD o O 3 14 3 4 1 Management Tools Troubleshooting e 3 14 3 5 Lustre Troubleshooting a a IES 3 17 A O kenaaees vacates 3 17 3 0 2 Suspected File System oO 3 17 3 5 3 Cannot re install a Lustre File System if the status is CRITICAL ooooooooooccnnoccinnocccionos 3 17 3 6 Lustre File System High Availability Troubleshooting ooooooocccinooccnnocccionocccoonnconanccconnccns 3 19 3 6 1 On the Management ds a REO 3 19 3 6 2 Onthe Nodes of an I O Pair cccccccceeecccccececceccececceccusecceccusececccusececeeutereeceaaereeees 3 21 3 7 AUR Troubleshooting essea i a en aa T E A a E E 3 24 3 7 1 O E A A EA 3 24 3 7 2 SLURM is NTE A oct Se a ae Sao ae ees 3 24 3 7 3 J
98. ltaire ISR9024D Voltaire bali23 HCA 1 ibdoctor is Bull tool which calls on the ibtracert ibnetdiscover and smpquery diagnostic tools whilst at the same time interfacing with the ClusterDB database so that any problems in the InfiniBand network can be identified easily ibdoctor Command ibdoctor may be used e to identify where any problem adapters or nodes are located e to display communication paths including bandwidth between ports in a human readable format Options s lt src_lid gt Use specified source lid d lt dst_lid gt Use specified destination lid Trace route between lt src_lid gt and lt dst_lid gt T Report the fabric state over all known routes h Help Example e To display status data for the path between two InfiniBand adapters with the local identifiers 0x14 and Oxle enter The output looks as follows RACK2 M lid 0x14 port 1 guid 0002c90200234144 state Active width 4X rate 5 0 lid 0x11 port 2 guid 0008f10400411lda2 state Active width 4X rate 5 0 lid 0x11 port12 guid 0008 10400411da2 state Active width 4X rate 5 0 RACK2 K lid Oxle port 1 guid 0002c902002341b1 state Active width 4X rate 5 0 e The T option completes an exhaustive scan of the network and traces and checks all the possible routes between the adapters ibdoctor t s 0x14 d Oxle ibdoctor T The output looks as follows Day to Day Maintenance Operations 2 37 Gbps Gbps Gb
99. m Protocol USB Universal Serial Bus W WWPN World Wide Port Name Index etc clustmngt nsclusterstart conf file 2 5 etc clustmngt nsclusterstop conf file 2 5 proc file 2 40 B backing up system See mkCDrec BIOS parameters settings 7 4 BIOS update 7 1 BMC firmware update 4 1 bootable system image See mkCDrec e CD ROM backup See mkCDrec CLI Remote Hardware Management 2 8 clone dsk sh script 2 15 clustat command 3 21 ClusterDB CPU and memory values 2 42 Commands clone disk sh 2 15 clustat 3 21 conman 2 2 crash 2 40 dbmConfig 1 3 e2fsck 3 17 fsck 2 15 ibchecknet 3 9 ibcheckportwidth 3 10 ibcheckwidth 3 10 ibdoctor 2 37 ibnetdiscover 3 9 ibstat 2 18 ibstatus 2 18 ibtracert 2 38 ioshowall 3 21 ipmitool 2 4 Ictl 3 22 Imdiag 3 28 Ipflash 2 13 Iputil 2 13 Isiocfg 2 33 lustre_check 3 19 lustre_migrate hastat 3 19 lustre_migrate nodestat 3 19 lustre_util 3 20 nec_admin 3 16 nsctrl 1 1 openib 3 5 perfquery 3 8 postbootchecker 2 42 restore fs sh 2 15 SINFO 1 1 smpquery 3 5 start restore sh 2 15 2 17 storioha 3 21 stormap 3 22 switchname 2 30 ulimit 2 37 ConMan using 2 2 conman conf file 2 3 Core Dump Size modifying 2 37 cpu_model 2 42 crash 2 40 D dbmConfig command 1 3 Debugging tools 2 37 Dump Size modifying 2 37 E e2fsck command 3 17 Emulex FC adapter 2 33 Emulex HBA firmware
100. multaneous nsmpower processes Default 30 only_test o Display the NS Commands that would be launched according to the specified options and action This is a testing mode no action is performed time Time to wait after the number of nsm calls defined by the interval option verbose v Verbose mode Parameters Type lt device type gt Type of devices to be pinged disk_array or server command on or off devices Specify the name of the devices using the basenameli k or Ic like syntax Examples e The following command verifies that all the power supplies for disk_array 10 to 15 are in on state and indicates those which are not pingcheck Type disk_array on da 10 15 e The following command verifies that servers nova5 to 7 are in off state and indicates those which are not pingcheck Type server off nova 5 7 2 36 BASS for Xeon Maintenance Guide OUT INTO OUT INTO 2 5 Debugging Maintenance Tools 2 5 1 Modifying the Core Dump Size By default the maximum size for core dump files for Bull HPC systems is set to O which means that no resources are available and core dumps cannot be done In order that core dumps can be done the values for the ulimit command have to be changed For more information refer to the options for the ulimit command in the bash man page 2 5 2 Identifying InfiniBand Network Problems ibdoctor ibtracert 2 5 2 1 bali4 HCA 1 ISR9024D Vo
101. n Ethernet switch must be replaced the MAC address of the new switch must be set in the ClusterDB This is done as follows Obtain the MAC address for the switch generally written on the switch or found by looking at DHCP logs Use the phpPgAdmin Web interface of the DATABASE to update the switch MAC address http IPadressofthemanagementnode phpPgAdmin user clusterdb and password clusterdb In the eth_switch table look for the admin_macaddr row in the line corresponding to the name of your switch Edit and update this MAC address Save your changes Run a dbmConfig command from the management node domConfig configure service sysdhcpd force nodeps 5 6 Power off the Ethernet switch Power on the Ethernet switch The switch issues a DHCP request and loads its configuration from the management node lm See Bull HPC BASS for Xeon Administrator s Guide for information about how to perform changes for the management of the ClusterDB 1 3 Stopping Restarting a Backbone Switch The backbone switches enable communication between the cluster and the external world They are not listed in the ClusterDB It is not possible to use ACT for their reconfiguration Stopping Starting Procedures 1 3 1 4 Stopping Restarting the HPC Cluster 1 4 1 Stopping the HPC Cluster To stop the whole cluster in complete safety it is necessary to launch different stages in sequence The nsclusterstop script includes al
102. n square brackets b The consequences of the problem for the node Three states are possible not touched The node was excluded by the deployment with no impact for the node restored The configuration of the node was modified but its initial configuration was able to be restored corrupt The node was corrupted by the operation c The circumstance which led to the deployment problem Example node2 not touched node is configured in Most of the time the information in the excluded node list allows the source of the problem to be identified without the need for further analysis Possible Deployment Problems There are 2 areas where deployment problems may occur Pre check problems Before the image is deployed node states are verified in the ClusterDB Database and through the use of nsm commands If there are any problems the nodes in question will be excluded for the deployment The error will be displayed once the deployment has finished and will also be logged in the tmp ksisServer ksis_exclude_nodes_list file BASS for Xeon Maintenance Guide 3 3 2 2 Image transfer problems Problems may occur during the phase when the image is being transferred onto the target nodes These problems are logged and centralised by Ksis on the Management Node The errors will be displayed once the deployment has finished and will also be logged in the tmp ksisServer ksis_exclude_nodes_list file ksis image server lo
103. ndor daemon cannot talk to Imgrd This means a pre version 3 0 Imgrd version is being used with a 3 0 vendor daemon Simply use the latest version of Imgrd MUST be a version equal to or greater than the vendor daemon version This can also happen if TCP networking does not function on the node where you are trying to run Imgrd rare Troubleshooting 3 29 3 30 No licenses to serve The license file has only uncounted licenses and these do not require a server Uncounted licenses have a 0 or uncounted in the number of licenses field on the FEATURE line Other Starting Imgrd intel from a remote directory may lead to unknown results If Imgrd intel is started from a remote directory the license file line VENDOR INTE Should be modified to include the root directory where the INTEL vendor daemon resides VENDOR INTEL lt root directory path gt The Imgrd intel daemon MUST be started with the c argument cd lt installation directory gt pwd Imgrd intel c pwd server lic 1 pwd lmgrd intel log Application Execution Problems Cannot connect to license server Usually this means the server is not running lt can also mean the server is using a different copy of the license file which has a different port number than the license file you are currently using indicates You can use the Imdiag utility to more fully analyze this error License Server does not support this Feature This means
104. nly mode or on NFS mounted disk or tape The backups are protected and are inaccessible for non authorized users The mkCDrec tool can be used for the following functions To restore software After booting from the mkCDrec CD ROM or DVD ROM the etc recovery startrestore sh script will do the following Restore the complete system after a problem of some kind for example a disk crash or a system intrusion Restore a particular disk using the backup source Restore a backup of a disk onto a new bigger disk in the system To make multiple backup copies As a rescue tool for example to do fsck operations or to diagnose what s wrong with the system See the mkCDrec utilities in order to add more tools to your rescue CD ROM or DVD ROM To clone a disk to another disk even when the target disk is smaller in size than the original disk as long as there is room for the data The clone dsk sh script will calculate the partition layout for you It is possible to make multi volume CD ROMs so backups can be split up It is also possible to backup all the data required for booting onto a CD ROM in order to obtain a bootable CD ROM and to save other data onto TAPE To restore a single file system to an existing partition using the restore fs sh command The user can select the target file system type which has to be formatted The command has no arguments To setup or migrate to LVM Software RAID or another type of f
105. nning jobs and other state information will be lost when using this option 373 Jobs are not getting scheduled 1 This is dependent upon the scheduler used by SLURM Run the following command to identify the scheduler scontrol show config grep SchedulerType See the Bull HPC Administrator s Guide for a description of the different scheduler types 2 For any scheduler the priorities of jobs can be checked using the following command scontrol show job 3 7 4 Nodes are getting set to a DOWN state 1 Check to determine why the node is down using the following command scontrol show node lt name gt This will show the reason why the node was set as down and the time when this happened If there is insufficient disk space memory space etc compared to the parameters specified in the slurm conf file then either fix the node or change slurm conf For example if the temporary disk space specification is TmpDisk 4096 but the available temporary disk space falls below 4 GB on the system SLURM marks it as down 2 Ifthe reason is Not responding then check the communication between the Management Node and the DOWN node by using the following command ping lt address gt Check that the lt address gt specified matches the NodeAddr values in the slurm conf file If ping fails then fix the network or the address in the slurm conf file Troubleshooting 3 25 3 Login to the node that SLUR
106. nsole window 6 When the BIOS update is ended the DOS prompt appears in the console window 7 2 BASS for Xeon Maintenance Guide 7 Select the Virtual Media button and discard the BIOS DOS image 8 Reset the machine using the Remote Control button 7 2 Updating the BIOS on NovaScale R440 or R460 The BIOS update on these platforms is done through the Bull Update BIOS CD that allows upgrading the BIOS BMC firmware and FRUs Please follow the instructions provided with the CD Managing the BIOS on NovaScale R4xxx Machines 7 3 7 3 BIOS Parameter Settings for NovaScale Rxxx Nodes The BIOS parameter settings for the NovaScale R421 R421 El R422 R422 El Compute Nodes and R440 R460 R423 Service Nodes will normally be configured in the factory before the machines are delivered However if the cluster set up is changed the following settings can be used to reset the machines back to their original state Notes e The settings shown in the tables are the default values The parameter values that have to be changed for HPC are indicated in green and bold Some of these settings for example for the storage will vary according to the cluster and will differ from the settings shown in the tables and screen grabs 7 3 Examples AR A AN NA Disabled Allows the system to QuietBoot Mode Disabled skip certain tests POST Errors Enabled while booting This will decrease the ACPI Mode Ye
107. o the host as in the example above it probably means that the FTP server has not been installed on the host ftp connect Connection refused ftp gt quit Updating the firmware for the InfiniBand switches 5 3 5 3 Upgrading the firmware In the following example it is assumed that the end user stored the firmware in the existing path to firmware directory 1 Extract the firmware archive to the path to firmware directory as follows cd path to firmware tar xvf Ver_10 06_fw 1 0 0 tar voltaire_fw_images tar voltaire_fw_ini tar howto_upgrade_voltaire_switch txt 2 Once the firmware has been extracted log on to the switch and proceed with the upgrade a Upgrading the firmware for the whole switch user host ssh enable switchname enable switchname s password voltaire Welcome to Voltaire Switch switchname Connecting switchname update firmware chassis lt path_to_firmware gt b Upgrading the firmware for a specific line board line board 4 in the example below user host ssh enable switchname enable switchname s password voltaire Welcome to Voltaire Switch switchname connecting switchname update firmware line 4 lt path_to_firmware gt c Upgrading a fabric board fabric board number 2 in the example below user host ssh enable switchname enable switchname s password voltaire Welcome to Voltaire Switch switchname Connecting switchname update f
108. obs are not getting scheduled A O ei cwreleaes 3 25 3 7 4 Nodes are getting set to a DOWN state ooooooocccccoccccooooccconnccoonccconncconnnccnonncnnnnnoos 3 25 3 7 5 Networking and Configuration Problema 3 26 ELO More amd Ada 3 27 3 8 FLEXIm License Manager Troubleshooting oooocooococnooccccoocccoooncconnncconnncconnnnconnncccnnnncnns 3 28 3 8 1 A A celts Aiea aehagticcetie 3 28 3 8 2 Using the Imdiag utility oso e dr dd Sane aye neds 3 28 3 8 3 Using INTEL LMD_DEBUG Environment Variables aia 3 28 Chapter 4 Accessing Updating and Reconfiguring the BMC Firmware on NovaScale A EE EI ET 4 1 4 1 The Baseboard Management Controller BMC ooooooocoococionocononoccconococoncnccoonc conc nccooncconnnno 4 1 Ai Localaccess tothe BME sata 4 1 41 27 Remote access to the BMC ririri e ais 4 2 4 2 Updating the BMC Firmware on NovaScale R421 R422 R422 El and R423 machines 4 4 4 3 Updating the BMC firmware on NovaScale R440 and R460 machines ooooooooccccoccccconccccno 46 4 4 Reconfiguring the BMC on RA xe machines cs ida 4 6 Chapter 5 Updating the firmware for the InfiniBand switches 00000 5 1 5 1 Checking which Firmware Version is running ccscccceeeceeeeeeeeseceeeeseeceseeeeeseeeeseeeeeseeees 5 1 iv BASS for Xeon Maintenance Guide 5 2 Configuring FTP for the firmware Upgrade it nica 5 2 5 2 Installingthe FIP Sa a lea 5 2 5 2 2 Configuring the FTP server options for the InfiniBan
109. ons are up and running on all nodes scontrol ping scontrol show node 3 Check the controller and or slurmd log files Slurmctldlog and SlurmdLog in the slurm conf file for an indication of why a particular node is failing 3 26 BASS for Xeon Maintenance Guide A Check for consistent slurm conf and credential files on the node s experiencing problems 5 If the problem is a user specific problem check that the user is configured on the Management Node as well as on the Compute Nodes The user does not need to be able to login but his user ID must exist User authentication must be available on every node If not non root users will be unable to run jobs 6 Verify that the security mechanism is in place see chapter 6 in the Bull HPC BASS for Xeon Administrator s Guide for more information on SLURM and security 7 Check that a consistent version of SLURM exists on all of the nodes by running one of the following commands sinfo V or rpm qa grep slurm If the first two digits of the version number match it should work fine However version 1 1 commands will not work with version 1 2 daemons or vice versa Errors can result unless all these conditions are true 8 Each node must be synchronized to the correct time Communication errors occur if the node clocks differ Execute the following command to confirm that all nodes display the same time pdsh a date To check a group of nodes us
110. ot all the HBAs on a server Syntax Ipflash lt m LP_Model f path_to_firmware v gt lt h gt lt V gt Flags m model Emulex HBA model to flash case insensitive file firmware file y verbose mode displays help V displays version Example lpflash m 1p11000 f tmp bd210a7 all This command will upgrade all LP1 1000 HBA to 2 1047 firmware Day to Day Maintenance Operations 2 13 2 2 6 3 Upgrade Emulex Firmware on Multiple Nodes Running the pdcp pdsh commands Emulex firmware can be upgraded in one shot on a set of nodes e use pdep to copy the new firmware file on all the nodes e use pdsh to run Ipflash on these nodes Example The following commands copy the Emulex firmware file on to nodes node1 node2 and node3 and then upgrade all Emulex LP1 1000 HBA on these nodes with firmware 2 10A7 pdcp w nodel node2 node3 bd210a7 all tmp pdsh w nodel node2 node3 lpflash m 1p11000 f tmp bd210a7 all 2 14 BASS for Xeon Maintenance Guide 2 3 Saving and Restoring the System mkCDrec To save and restore the Management Node system use the mkCDrec make CD ROM recovery mkCDrec is an Open Source tool used to create a bootable system image which includes Linux system save The image is used to restore the system after a problem such as a disk crash or system intrusion has occurred The backups are generally on CD ROM or DVD ROM or on an offline disk preferably in read o
111. owing ones file lt filename gt To send to a file tcp lt ip gt lt port gt and udp lt ip gt lt port gt To send the logs on the network to another machine unix stream lt filename gt To send to stream pipes used in Linux userttyr lt user gt To send to the lt user gt consoles but only if this user is connected You can use the character to specify that the messages have to be sent to all users program lt commandiorun gt To send towards a program 2 10 BASS for Xeon Maintenance Guide Examples You can specify several destination directives in a destination section as in the following example destination debug file var log debug log destination messages file var log messages log destination console usertty root destination xconsole pipe dev xconsole destination mail2admin program usr bin MailToAdmin destination full file dev tty12 file var log full log log_fifo_size 2000 y EF Note You can add specific options such as 1og_fifo_size 2000 as shown in the example above In the following example all the logs will be sent to the Management Node whose address is 192 168 0 100 destination central_log tcp 192 168 0 100 port 514 Using Macros It may be useful to use macros to set intelligible names for your destination files Predefined macros exist such as FACILITY PRIORITY or LEVEL DATE FULLDATE
112. p paumopqut Z00 O 9 7ADWS I W otooto TI 1 tenaz 01 3 04 2 Y stooxo 31 3 tognet OH Bao aa 44u T Z pauMopyuty E paddo pg T A PI p 835 1pqx 0000 3 D0ws 2 Y store 31 3 T g29nest 01 9 20DWWY T Y tooro tI TI Ltsnez zpaddo ps IIA Spes tpr o 3 3WY Z 8tooxo s s T g30nAST o 9 ZADWS T 300070 Te 1 gtsnaz Y T pausopqur z paddo pg I A El p AR3 px eil 0 DOY 2 Y too0xo tl ti Q G2Q9N ET Bl Y DOWY I W 400010 TI TI snz O T paueopyxuty y paddo pg T A SI P 4895 tpw o a W3YY 2 Y toooxo El El G ognast 01 CZ ZADWS T zooote TI TI zsnaz gt z peddoupsT s Pp spse2stpywx To0 o 3 04 2 Y stooro TI tl T g2gne st 01 IZ TADWS 1 Y 6000 0 Tel TI Benz D z paddospsI A Z paeos tp zt 0 DOWS 2 Y toooxo 31 3 0 g30naST il HZ ZADWS T z tooxo Tl E gsn z ESparostpqx ol a tove 2 Y tooro z Z G29NMEt 01 Z TADWS I W ooore 11 TI Etsnaz O zpaddo psa L paeostpqux ol 8 DOwW 2 Y tooxo TI T Z Q29NeEt 01 WZ ZADWS T aooore 11 TI zisnez T Spsm2etpyuz 0000 8 DOY Z tooto 8 8 Z Q29NeEt ol 8 0 1 Y 9to0x0 TI TI ezsnaz O T peusopyuty TT spse2stpuux Bl 0 DOYY 2 Y Tee s Ss 0 ognast 01 OZ ZADWY 1 Y E000 0 11 TI esnaz E z peddoupst s S pse2stpawx 0000 a THIvY Z tooxo 3 3 zZ 929NAET O0 20DWY I W 300010 tl Il memz E T pausopyxuty 424033441 9 P
113. played Enter your selection 1 Create rescue CD ROM only no backups 2 Create ISO backup images in tmp to burn on CDROM or DVD 3 Create backup on disk mounted harf disk NFS mount point SMB mount point 4 Create backup on tape device dev nst0 5 Quit Please choose from the above list 1 5 Select one of the displayed options 1 to 5 Follow the instructions displayed on the screen When the operation is finished ISO images ready for burning will be created in the directory specified in the configuration file CDREC_ISO_DIR parameter r Note The mkcdrec log file can be checked in case of problem Before burning a CD DVD you can check the contents of the ISO image using the following command mount o loop backup ISO Cdrec iso mnt 2 9 3 Restoring a System To restore a system boot on the first CD ROM DVD ROM then run the command etc recovery start restore sh Follow the instructions displayed on the screen When the restore is completed enter the reboot command A new EFI boot entry is created Day to Day Maintenance Operations 2 17 2 4 2 4 1 2 4 1 1 Monitoring Maintenance Tools Checking the status of InfiniBand Networks ibstatus ibstat ibstatus Command ibstatus displays basic information obtained from each InfiniBand driver for the local adapter included in an InfiniBand network Normal output includes LID Subnet Manager LID port state
114. ple of use of this command including the Local ID and the port number is below smpquery portinfo 45 1 The resulting information output will be similar to that displayed below MESE base eat a ade 0x0000000000000000 GIAPrETIZ once ted a ae a tae et 0xfe80000000000000 ELA ad id dad ade ao 0x002d SM Opie igs Secreta eae can Bh Sade ance 0x0003 CapMa Sia Subse ie 0x500a68 IsTrapSupported IsAutomaticMigrationSupported IsSLMappingSupported IsLedInfoSupported IsSystemImageGUIDsupported IsVendorClassSupported IsCapabilityMaskNoticeSupported DAG COGS nana id ais 0x0000 MkeyLeasePeri0di oooo ooo oo oooo 0 LOCALPOLC giuri sine alada eiii 2 LinkWidthEnabled 1X or 4X LinkWidthSupported 1X or 4X LDInkKWLdbhAGEa Ver uote a eae eae 4X LinkSpeedSupported 2 5 Gbps DUNK SARS che ee ek celts BOE Se er aida ve Active PHysSLinkStates sra ee ee ce eee LinkUp LinkDownDefState Polling PEOLECEBLES eat ada 0 META it ia aa te 0 LinkSpeedActiVel c ooo oooooo 2 5 Gbps LinkSpeedEnabled 2 5 Gbps NET GRD OEMT U s tani decd ie daa aida ae 2048 A aie sais Ea RRM eos be agate 0 VEGADA aa ease eet te VLO 7 INLET PE iras dr da 0x00 VEHTGALAMIE endo e 0 VLATDALOACAD ide pda is id 8 36 BAS5 for Xeon Maintenance Guide INTEREPIYS sic ll a lo 0x00 CUCA rta a di 2048 VESTALLCOUNAE Paid rr Bake Sua ee 7 A wade ee
115. provides a simple command line interface to the BMC Baseboard Management Controller To use SOL Serial Over Lan interface run the following command ipmitool I lanplus C O U lt BMC_user_name gt P lt BMC_password gt H lt BMC_IP_Address gt sol activate BMC_user_name BMC_password and BMC_IP_Address are values defined during the configuration of the BMC and are taken from those in the ClusterDB The standard values for user name password are administrator administrator ipmitool Command Useful Options e To start a remote SOL session to access the console ipmitool I lanplus C 0 H lt ip addr gt sol activate e To reset the BMC and return to BMC shell prompt ipmitool I lanplus C 0 H lt ip addr gt bmc reset cold e To edit the FRU of the machine ipmitool H lt ip addr gt fru print e To edit the network configuration ipmitool 1 lan H lt ip_addr gt lan print 1 e To trigger a dump signal INIT ipmitool H lt ip addr gt power diag e To power down the machine ipmitool H lt ip addr gt power off e To perform a hard reset ipmitool H lt ip addr gt power reset e To display the events recorded in the System Event Log SEL ipmitool H lt ip addr gt sel list e To display the MAC address of the BMC ipmitool I lan H lt ip addr gt raw 0x06 0x52 0x0f 0Oxa0 0x06 0x08 Oxef Note If H is not specified the command will address the BMC of the local mach
116. ps Gbps 28 lids found SR9024D M Voltaire lid 0x1 port 0 guid 0008f10400411e54 state Active width 4X rate 2 5 SR9024D Voltaire lid 0x2 port 15 guid 0008f10400411d6a state Active width 4X rate 5 0 SR9024D M Voltaire lid 0x1 port 0 guid 0008f10400411e54 state Active width 4X rate 2 5 SR9024D Voltaire lid 0x11 port 13 guid 0008 10400411da2 state Active width 4X rate 5 0 SR9024D Voltaire id 0x11 port 18 guid 0008 10400411da2 state Active width 4X rate 5 0 SR9024D Voltaire id 0x3 port 6 guid 0008f10400411d70 state Active width 4X rate 5 0 SR9024D M Voltaire lid 0x1 port 0 guid 0008f10400411e54 state Active width 4X rate 2 5 SR9024D Voltaire lid 0x2 port 15 guid 0008f1040041ld6a state Active width 4X rate 5 0 SR9024D Voltaire id 0x2 port 4 guid 0008f10400411ld6a state Active width 4X rate 5 0 bali6 HCA 1 RACK1 D id 0x4 port 1 guid 0002c90200234405 state Active width 4X rate 5 0 SR9024D M Voltaire lid 0x1 port 0 guid 0008f10400411e54 state Active width 4X rate 2 5 SR9024D Voltaire lid 0x2 port 16 guid 0008f1040041ld6a state Active width 4X rate 5 0 SR9024D Voltaire lid 0x2 port 5 guid 0008f10400411d6a state Active width 4X rate 5 0 bali HCA 1 RACK1 El lid 0x5 port 1 guid 0002c9020023440d state Active width 4X rate 5 0 SR9024D M Voltaire lid 0x1 port 0 guid 0008f10400411e54 state Active width 4X rate 2 5 SR9024D Voltaire lid 0x2 port 3 guid 0008f1040041ld6a state Act
117. r about specified console s Q Be quiet and suppress informational messages r Match console names via regex instead of globbing y Be verbose V Display version information Once a connection is established enter amp to close the session or amp to display a list of currently available escape sequences See the conman man page for more information Examples e To connect to the serial port of NovaScale bu1147 run the command conman bull47 Configuration File The etc conman conf file is the conman configuration file It lists the consoles managed by conman and configuration parameters The etc conman conf file is automatically generated from the ClusterDB information To change some parameters the administrator should only modify the etc conman tpl conf template file which is used by the system to generate etc conman conf It is also possible to use the dbmConfig command See the Cluster Data Base Management chapter for more details See the conman conf man page for more information Eg Note The timestamp parameter which specifies the watchdog frequency is set to 1 minute by default This value is suitable for debugging and tracking purposes but generates a lot of messages in the var log conman file To disable this function comment the line SERVER timestamp 1m in the etc conman tpl cfg file Day to Day Maintenance Operations 2 3 22 1 2 Using ipmi Tools The ipmitool command
118. re details are available e firmware levels e serial number e WWNN and WWPN for fibre channel HBAs Example lsiocfg cv aes HOST CHANNEL INVENTORY Host Driver Unique_id Cmd Lun HostQ State Model host0 mptbase 0 7 host1 mptbase I 7 host2 1pfc 0 30 LINK_UP LP11000 DRV 8 0 30_p1 FW 2 10A7 B2D2 10A7 Bus Number 26 SN VM53824841 Host WWNN 20 00 00 00 c9 4b e7 02 Host WWPN 10 00 00 00 c9 4b e7 02 FN 20 00 00 00 c9 4b e7 02 speed 2 Gbit host3 usb storage 0 1 2 4 4 2 Disks Inventory Using the lsiocfg Disk inventory option you can get basic information about the available disks e system location e vendor e state e disk size When getting the disk inventory in verbose mode more details are shown e model e serial number 2 34 BASS for Xeon Maintenance Guide 2 4 4 3 e firmware revision e WWPN fiber channel devices lsiocfg dv DISK INVENTORY Dev Location Maj Min Vendor state Size MB QueueDepth Lname location Host Channel Id LUN sdb 0 0 10 0 8 16 SEAGATE running 286102 31 MODEL SEAGATE ST3300007LC FWREV 0003 SERIAL 3KROKTPH00007547TROP TRANSPORT SPI sdc Os0 11 0 8 32 SEAGATE running 286102 31 MODEL SEAGATE ST3300007LC FWREV 0003 SERIAL 3KROKTHM000075475NWC TRANSPORT SPI sda 0 0 9 0 8 0 SEAGATE running 286102 31 MODEL SEAGATE ST3300007LC FWREV 0003 SERIAL
119. rs per Million packets sent T See FAQ ID F10040 How to debug and clear InfiniBand fabric errors using FVM PM Counters CSV file available from www voltaire com for details of the different Port Counter error messages Day to Day Maintenance Operations 2 23 paez z paddo ps ria pauz z paddoJpgT A sna ope pau suona a DOY a y amor a DOYy a DOY a D0Y B DOYY a DOY 2 10 M0 3 0 PTAS PLE 3 DOY PTASIE PIAC PLEI 3 D0YY TITO VS D A WY WZ ZIOYS T Y DZ ZADYY UY YZY CZZY UY NOTLYD07 z Y rev 20 Y rev 2 Y rev iz Y iz Y z Y 2 2 Ei z Y iz Y rev rr T Y T 1 Z g20Ne t z 020ne 1 Z QI0NAGL z 020nez1 2 020nes1 z ggnast z ggnest Z 920NeST 1 GIGNAST 1 039nez 1 1 039naS 1 1 020NeS1 T1 G39nez 1 T 039nesT T 039ne t T g29neet genaz panaz gsnoz Esnaz snaz zsn z SWYNLSOH 33187104 23183704 2123704 2ITRITOA 2183704 21183104 34183704 l l 24363104 l l l OPZOGYSI OrZOGYSI OPZOGYSI OPZOGYSI OPZOGYSI OPZOGYSI OPZOGYSI OPZOGYSI OPZOGYSI OPZOGYSI OPZOGYSI OPZOGYSI OPZOGYSI OPZOGYSI OPZOGYSI OPZOGYSI 2123104 4187104 2J183 04 21187704 2123104 23183104 2123704 4187104 SOHTUTJUI STESZIM ZOHTUTJUI BIZSTIN T Y3H 95n z T Y3H Esnaz T Y3H gsnez SOHTUTJUI BIZSZIH
120. rt 0008 104004118e2 8 lid 0x4 0x4 ISR9024D Voltaire 13 gt switch port 0008f104004118e8 16 lid 0x3 0x3 ISR9024D M Voltaire 21 gt switch port 0008f 104004118e4 13 lid 0Ox1 0x1 ISR9024D Voltaire 4 gt ca port 0008 10403979985 1 lid 0x2c 0x2c lynx19 HCA 1 To ca 0008 10403979984 portnum 1 lid 0x2c 0x2c lynx19 HCA 1 In short gt OUT lynx13 lid 0x22 port 1 gt INTO node switch lid 0x4 port 8 gt OUT node switch lid 0x4 port 13 gt INTO top switch lid 0x3 port 16 gt OUT top switch lid 0x3 port 21 gt INTO node switch lid 0x1 port 13 gt OUT node switch lid 0x1 port 4 gt INTO lynx 19 lid 0x2c port 1 Day to Day Maintenance Operations 2 39 203 Using dump tools with RHEL5 crash proc kdump Various tools allow problems to be analysed whilst the system is in operation e crash portrays system data symbolically using the possibilities provided by the GDB debugger The commands which it offers are system oriented for example the list of tasks tracing function calls for a task which is waiting etc See the crash man page for more information e The system file proc may be used to view and if necessary modify system information In particular it can be used to examine system information for different tasks the state of the memory allocation etc See the proc man page for more information e In the event of a system crash memory
121. rt nsclusterstop help only_test o Display the commands that would be launched according to the specified options This is a testing mode no action is performed verbose v Verbose mode Configuration files etc clustmngt nsclusterstart conf e First Part is used to cont e FEAE TE AE TE FE AE TE E TE FE TE FE AE FE HE TE FE EERE EE HERE EE EE EE EH rol the power supply of DDN and servers FE FEAE TE FEAE TE AE FE FE AE FE FE AE FE AE FE FE FE FE AE AE FE AE FE FE AE FE FEAE FE AE FE EFE EH time to wait for all diskarrays ok before powering the powerswitches on disk_arrays_StartDelay 300 Day to Day Maintenance Operations 2 5 time to wait for all powerswitches being ON after a poweron couplets_StartDelay 60 time to wait after poweron for all servers being effectively operational servers_StartDelay 480 dd Following part is used to control the order to start nodes groups dd GROUP lt nb simultaneous poweron gt lt time to wait gt lt period to wait gt lt time to wait after this GROUP gt etc clustmngt nsclusterstop conf dd First Part is used to control the power supply of DDN and servers PEE EE FE AE FE FE AE FE AE AE FE AE FE FE AE FE EE FE FE AE FE FE AE FE EE FE FE AE EEE EE HE EE EE EE BRE EH time to wait after poweroff for all servers being effectively down servers_StopDelay 180 time to wait for ddn processing shutdown ddnShutdown_Time 180
122. ry the IB subnet manager to obtain and update data using the error and traffic counters Data related options By default IBS analyses the data contained in the IBSDB database unless the s or l flags are used This default mode is known as database mode s lt switch gt Connected mode Connect to the switch specified by its hostname or IP address and then retrieve the NetworkMap xml and PortCounters csv files for this switch Local mode Use the NetworkMap xml and PortCounters csv files that are available locally or that are specified by the f and c flags for the analysis These files can then be analysed separately on a machine which is not part of the cluster However as stated above it is better to work within the OFED stack using the N and E options to obtain the latest data 2 20 BASS for Xeon Maintenance Guide 2 4 2 1 f filename Specify the file to be used when loading or saving the network map file NetworkMap xml When used in conjunction with the s switch option the file downloaded from the switch will be saved to file lt filename gt When used in conjunction with the flag the specified file will be used as the input file c filename Specify the file to be used when loading or saving the port counters file PortCounters csv file When used in conjunction with the s switch option the file downloaded from the switch will be saved to the file lt filename gt When used in conjunction
123. s time needed to boot Power Button Behavior Instant Off the system Resume On Modem Ring Off Power Loss Control Last State Watch Dog Disabled Summary screen Enabled F1 Ay 4 F9 Esc lt Enter F10 eG hs aera UNAS Enabl edi Configure serial port A Base 1 0 address 3F8 using options Interrupt IRQ 4 Serial port B Enabled Disabled Mode Normal No configuration Base 1 0 address 2F8 Interrupt IRQ 3 Enabled User configuration Autol BIOS or OS chooses configuration OS Controlled Displayed when controlled by OS e Fl Ay t F9 Esc lt Enter F16 Figure 7 2 Example BIOS parameter setting screen for NovaScale R422 7 4 BASS for Xeon Maintenance Guide 7 3 2 NovaScale R421 BIOS Settings mainboard X7DBR 8 X7DBR l R421 BIOS 1 3 BIOS setup section parameter value Main System Time System Date Legacy diskette A Serial ATA Enabled Native Mode Operation Serial ATA SATA Controller Mode Option Compatible Advanced Boot Features QuickBoot Mode Disabled QuietBoot Mode Disabled POST Errors Disabled ACPI Mode Yes Power Button Behaviour Instan
124. s for full details on using these tools 3 2 BASS for Xeon Maintenance Guide 3 10 CLI Diagnostic Tools de Lal zero counters script To clear out all the errors across the fabric use the zero counters script to traverse the fabric and clear out all the port counters on both the switches and HCAs This script is very easy to use and is helpful if you want to start off with a clean baseline of your fabric after many changes have occurred ISR9288 utilities Zero All Counters lid 1 ports 24 KKKKK KKK KKK KKK KKK KKK KK KK lid 5 ports 24 KKK KKK KKK KKK KKK KKK KKK KKK lid 4 ports 24 KKKKKKKK KKK KKK KKK KKK KK KK lid 3 ports 24 KKK KKK KKK KKK KKK KKK KKK KKK lid 2 ports 24 KKKKKKKKKKK KKK KKK KK KKK KK lid 11 ports 24 zZ ro counters KKK KKK KKK KKK KK KKK KK KKK KK Note See the Voltaire Switch User Manual ISR 9024 ISR 9096 and ISR 9288 2012 Switches for full details on the CLI commands 3 1 5 2 width check script Another valuable script is the width check script which allows you to easily check the fabric for 1X connections links While the fabric will work over a 1X connection it will however create a bottleneck and hurt performance within the fabric All links should report no 1X connections when the script is ran Nothing else will be reported other than the LID and GUID if it s a full 4X link ISR9288 utilities width check Verify every error found wil
125. s Management Packets GMP to obtain the PortCounters basic performance and error counters from the Performance Management Attributes at the node specified The command syntax is shown below perfquery options lt lid guid gt port reset_mask Non standard flags a Show aggregated counters for all port of the destination lid r Reset counters ofter read R Only reset counters Examples e To read local port s performance counters enter perfquery e To read performance counters from lid 32 port 1 enter perfquery 32 1 e To read node aggregated performance counters enter perfquery a 32 e To read performance counters and reset enter perfquery r 32 1 e To reset performance counters of port 1 only enter perfquery R 32 1 e To reset performance counters of all ports enter perfquery R a 32 e To reset only non error counters of port 2 enter perfquery R 32 2 0xf000 Example output The resulting information output will be similar to that displayed below Port counters Lid 45 port 2 POVrUSSLOOE tica an 2 COUNTEr Sele Chins caida a cali 0x0000 SyMbDOLEEEIES a ia a Crewe EE 0 BASS for Xeon Maintenance Guide LANKRECOVELS E recen ae aia cers 0 LINKDOWNASA siii a hated os 0 REVELCOLS ne cia 0 RecvRemotePhysErrorS t oooo ooo 0 RecvSwRelayErTOYTS ooooooooo o 0 AMEDLISCACOS E ire ae la alo a 2
126. s should be checked against the maximum which is possible For example if the port supports 4 x bandwidth then this should be used Similarly if the adapter supports DDR then this should be used Syntax ibcheckportwidth h v G lt lid guid gt lt port gt Example ibcheckportwidth v 0x2 1 Output Port check lid 0x2 port 1 OK BASS for Xeon Maintenance Guide 3 2 5 More Information Please refer to the man pages for more information on the all tools described in this section and also on the other OpenIB tools which are available Troubleshooting 3 11 3 3 3 3 1 3 3 2 3 3 2 1 3 12 Node Deployment Troubleshooting ksis is the deployment tool used to deploy node images on Bull HPC systems This section describes how deployment problems are logged by ksis for different parts of the deployment procedure ksis deployment accounting Following each deployment ksis take stock of the nodes and identifies those that have had the image successfully deployed onto them and those that have not This information is listed in the files below and remains available until the next image deployment e List of nodes successfully deployed to tmp ksisServer ksis_nodes_list e List of nodes not deployed to tmp ksisServer ksis_exclude_nodes_list When the image has failed to be deployed to a particular node Ksis adds a line in the ksis_exclude_nodes list file to indicate a The name of the node betwee
127. s wos s pe y gt suod gt J ut2329 quiz deyqJ0r33p 2714 404 asenbs4 Butpu s 0 GI0NAST Y 1AS 03 6uty uuoy odo3 g 3NA Q GIGNTST S sqr 232203004 Figure 2 1 Example of IBS command topo action output 2 22 BAS5 for Xeon Maintenance Guide Use the command below to obtain the fabric topology using the data stored in the IBS database The hostnames and traffic counters are updated using the OFED tools ibs a topo NE Use the command below to dump the fabric topology using the local map file test NetworkMap xml and test portcounters csv The data read from these files is updated using the OFED tools ibs 1 f test NetworkMap xml c test portcounters csv a topo NE bandwidth The syntax for the bandwidth action is shown below This action is very useful when benchmarking in order to monitor the performance of switch and to identify any bottlenecks ibs s lt switch_name gt a bandwidth NE Details of packets sent and received for the switch for both local and remote connections are displayed as shown in Figure 2 2 errors The errors action can be used to produce a short report containing details of the faulty links for a switch This is very useful for troubleshooting and will help to pinpoint any problems for the interconnects ibs s lt switch_name gt a errors NE This will give output similar to that shown in Figure 2 3 EPM indicates the error rate in the form of Erro
128. sername c umask 022 lmgrd c It is not recommended to run Imgrd as root the su username is used to run Imgrd as a non privileged user 2 Add sleep 2 after the Imgrd command Troubleshooting 3 31 3 32 BASS for Xeon Maintenance Guide Chapter 4 Accessing Updating and Reconfiguring the BMC Firmware on NovaScale R4xx machines This chapter describes how to update the BMC firmware on NovaScale R421 R422 R422 E1 R423 R440 and R460 machines 4 1 The Baseboard Management Controller BMC The Baseboard Management Controller BMC is used to monitor the hardware sensors for temperature cooling fan speeds power mode etc and to report any hardware errors by sending alerts It is also used for basic system management operations such as starting stopping and resetting a cluster It also provides a remote console on the cluster nodes via Serial over LAN access SOL The BMC is the intelligence in the Intelligent Platform Management Interface IPMI architecture The BMC manages the interface between system management software and platform hardware There are several ways to access the BMC of a machine 4 1 1 Local access to the BMC The BMC of the local machine can be accessed using the ipmitool command gt See Chapter 2 in this manual or the man page for more information The IPMI service must be started to access the local BMC via the IPMI driver service ipmi start Examples 1 To obtain the BMC LAN con
129. sive ISR9024D Voltaire MAIC Se dea a A ext hn itr bs tads age A ela E ee 152 A ON 8 Port 4 direct path from self switch 0 1 4 2 30 BAS5 for Xeon Maintenance Guide 2 4 3 3 2 4 3 4 2 4 3 5 Verifying the ports The whole Infiniband fabric can be checked using the port verify command as follows switchname utilities port verify Topology file generated on Thu Oct 4 20 19 24 2007 devid 0x5a31 switchguids 0x8f1040041254a Switch 24 s 0008f1040041254a ISR9024D M Voltaire smalid 8 1 S 0008 10400411946 13 width 4X speed 5 0 Gbs 2 S 0008 10400411946 14 width 4X speed 5 0 Gbs 3 S 0008 10400411946 15 width 4X speed 5 0 Gbs sce J devid 0x6282 hcaguids 0x2c9020024b940 Hca 2 H 0002c9020024b940 zeus8 HCA 1 1 S 0008 1040041281le 1 lid 72 lmc 3 width 4X speed 5 0 Gbs SUMMARY NO PROBLEMS DETECTED Checking the port width To ensure the best performance check that the ports are running in 4x mode as follows switchname utilities width check Verify every error found will be printed lid 8 guid 0008 1040041254a ports 24 lid 160 guid 0008 1040041281le ports 24 lid 152 guid 0008 10400411946 ports 24 Dealing with a faulty port When a faulty port is diagnosed it can be disabled or reset using the portmanage command as below iswu0c0 0 utilities port manage Description port manage sh is used to trigger
130. t 3B Option ROM Enabled PCI Slot 3C Option ROM Enabled Peripheral Configuration Serial port A Enabled Base I O address 3F8 Interrupt IRQ 4 Serial port B Enabled Base I O address 2F8 Interrupt RQ 3 USB 2 0 Controller Enabled Parallel ATA Enabled Serial ATA Enabled SATA Controller Mode Option Compatible Advanced Chipset Control Multimedia Timer Intel R I OAT Wake On LAN PME Enabled Wake On Ring Disabled Wake On RTC Alarm Disabled Boottime Diagnostic Screen Reset Configuration Data No Numlock on Memory Processor Error Boot Security Supervisor Password ls Clear Managing the BIOS on NovaScale R4xxx Machines 7 23 value BIOS setup section parameter User Password Is Clear Password on boot Disabled Fixed disk boot sector Normal Power Switch Inhibit Disabled Server Console Redirection BIOS Redirection Port ACPI Redirection Port Disabled Baud Rate Flow Control Terminal Type VT100 Remote Console Reset Assert NMI on PERR Enabled Assert NMI on SERR Enabled FRB 2 Policy Retry 3 Times Boot Monitoring Disabled Boot Monitoring Policy Retry 3 Times Thermal Sensor Enabled BMC IRQ IRQ 11 Post Error Pause Enabled AC LINK Last State Power On Delay Time 20 Platform Event Filtering Enabled Boot AN 00h OON 7 24 BASS for Xeon Maintenance Guide Glossary and Acronyms A ACT Administration Configuration Tool B BAS Bull Advanced Server BIOS Basic Input Outpu
131. t Off Resume On Modem Ring Off Power Loss Control Last State Watch Dog Disabled Summary screen Memory Cache Cache System BIOS area Cache Video BIOS area Cache Base 0 512k Cache Base 512k 640k Cache Extended Memory Area Discrete MTRR Allocation Write Protect Write Protect Write Back Write Back Write Back Disabled PCI Configuration Onboard G LAN1 OPROM Configure Onboard G LAN2 OPROM Configure Default Primary Video Adapter Emulated IRQ Solution PCl e I O Performance Disabled Onboard Disabled PCI Parity Error Forwarding Disabled ROM Scan Ordering Onboard First PCI Fast Delayed Transaction Disabled Reset Configuration Data No Frequency for PCIX 1 2 MASS Auto SLOT PCI X 100MHz Option ROM Scan Enabled Enable Master Enabled Latency Timer Default SLOT2 PCI X 100MHz ZCR Option ROM Scan Enabled Enable Master Enabled Latency Timer Default SLOT2 PCI X 100MHz ZCR Option ROM Scan Enabled Enable Master Enabled Latency Timer Default SLOT3 PCIExp x8 Option ROM Scan Enabled Enable Master Enabled Latency Timer Default SLOTA PCI Exp x8 Option ROM Scan Enabled Enable Master Enabled Managing the BIOS on NovaScale R4xxx Machines 7 5 BIOS setup section parameter value Latency Timer Default Option ROM Scan Enabled Enable Master Enabled SLOTS PCI Exp x8 Latency Timer Default Large Disk Access Mode DOS Advanced Chipset Control SERR signa
132. t System BMC Baseboard Management Controller C CLI Command Line Interface D DDN Data Direct Networks DHCP Dynamic Host Configuration Protocol E ECT Embedded Configuration Tool F FDA Fibre Disk Array FRU Field Replaceable Unit FTP File Transfer Protocol G GCC GNU C Compiler GNU GNU s Not Unix GPL General Public License GUI Graphical User Interface GUID Globally Unique Identifier H HBA Host Bus Adapter HPC High Performance Computing IPMI Intelligent Platform Management Interface K KSIS Utility for Image Building and Deployment Glossary and Acronyms G 1 L LAN Local Area Network LDAP Lightweight Directory Access Protocol LUN Logical Unit Number M MAC Media Access Control address MPI Message Passing Interface N NFS Network File System NIS Network Information Service NS NovaScale NTP Network Type Protocol P PCI Peripheral Component Interconnect Intel R RAID Redundant Array of Independent Disks G 2 BAS5 for Xeon Maintenance Guide S SCSI Small Computer System Interface SLURM Simple Linux Utility for Resource Management SMP Symmetric Multi Processing SMT Symmetric Multi Threading SNMP Simple Network Management Protocol SOL Serial Over LAN SSH Secure Shell T TCP Transmission Control Protocol TFTP Trivial File Transfer Protocol U UDP User Datagra
133. taneous nsm actions for example with 5 you can run 5 simultaneous nsmpower processes Default 30 only_test o Display the NS Commands that would be launched according to the specified options and action This is a testing mode no action is performed time Time to wait after the number of nsm calls defined by the interval option verbose v Verbose mode Specifying nodes The nodes are specified as follows basename i j k If no nodes are explicitly specified nsctrl uses the nodes defined by the pap or group option Actions poweron poweroff poweroff_force reset status ping Day to Day Maintenance Operations 2 7 2 2 4 La 2 8 Examples Note In the following examples the o option only_test is used to display which NS Commands would be launched for the specified action e To power off node ns1 enter nsctrl o poweroff_force nsl nsl usr NSMasterHW bin nsmpower sh a off_force m ipmilan H nsl u user2 e To ping node ns1 enter nsctrl o ping nsl nsl ping c 1 nsl Remote Hardware Management CLI NS Commands The Remote Hardware Management CLI Command Line Interface is a set of commands that perform hardware tasks on Bull HPC these are also known as NS Commands These commands provide the administrator with an easy way to automate scripts to power on off and to get hardware information about the nodes Managing System Logs syslog ng
134. terleave Branch O Rank Interleave 4 l Branch O Rank Sparing Disabled Branch 1 Rank Interleave 4 l Branch 1 Rank Sparing Disabled Enhanced x8 Detection Enabled Demand Scrub Enabled High Temp DRAM OP Disabled AMB Thermal Sensor Disabled Thermal Throttle Disabled Global Activation Throttle Disabled Force ITK Config Clocking Disabled Snoop Filter Enabled Crystal Beach Feature Enabled Route Port 80h cycles to LPC Clock Spectrum Feature Disabled High Precision Event Timer No USB Function Enabled Legacy USB Support Enabled Advanced Processor Options Frequency Ratio Default Core Multi Processing Enabled Machine Checking Enabled Fast String operations Enabled Thermal Management 2 Enabled C1 C2 Enhanced Mode Disabled Execute Disable Bit Enabled Adjacent Cache Line Prefetch Enabled Hardware Prefetcher Enabled Set Max Ext CPUID 3 Disabled Direct Cache Access Disabled Intel R Virtualization Technology Disabled Intel EIST support Disabled 1 O Device Configuration KBC Clock Input 12MHz Serial port A Enabled Base I O address Serial port A 3F8 Interrupt Serial port A IRQ 4 Serial port B Enabled Mode Normal Base I O address Serial port B 2F8 Managing the BIOS on NovaScale R4xxx Machines 717 BIOS setup section parameter value Interrupt Serial port B IRQ 3 Parallel Port Disabled Floppy disk controller Enabled Base I O address Primary DMI Event Logging Event Logging Enabled ECC Event Logging Enabled
135. tus ping checking temperature changing bios etc syslog ng System log Management Iptools Iputils Ipflash mkCDrec ibstatus ibstat Upgrading Emulex HBA Firmware Host Bus Adapter Backing up and restoring data Monitoring InfiniBand networks IBS tool Providing information about and configuring InfiniBand switches switchname Monitoring Voltaire switches Isiocfg Getting information about storage devices pingcheck ibdoctor ibtracert Checking device power state Identifying InfiniBand network problem crash proc kdump postbootchecker Table 2 1 Runtime debugging and dump tool Making verifications on nodes as they start Maintenance Tools Day to Day Maintenance Operations 2 1 2 2 2 2 1 2 2 1 1 Maintenance Administration Tools Managing Consoles through Serial Connections conman ipmitool The serial lines of the servers are the communication channel to the firmware and enable access to the low level features of the system This is why they play an important role in the system init surveillance or in taking control if there is a crash or a debugging operation is undertaken The serial lines are brought together with Ethernet Serial port concentrators so that they are available from the Management Node ConMan can be used as a console management tool See 2 2 1 1 Using ConMan e pmitool allows you to use a Serial Over Lan SOL l
136. upgrading 2 13 Index 1 1 F FDA troubleshooting 3 16 files conman conf 2 3 mkcdrec Config sh 2 16 syslog ng conf 2 9 firmware update BMC 4 1 InfiniBand switch 5 1 MegaRAID card 6 1 Voltaire switch 5 1 FLEXIm License Manager troubleshooting 3 28 fsck command 2 15 H HA consistent state 3 23 HA Lustre troubleshooting 3 19 Hardware Management CLI 2 8 ibchecknet command 3 9 ibcheckportwidth command 3 10 ibcheckwidth command 3 10 ibdoctor 2 37 ibnetdiscover command 3 9 IBS command 2 20 availability action 2 29 bandwidth action 2 23 config action 2 26 dbcreate action 2 26 dbdelete action 2 27 dbpopulate action 2 27 dbupdate action 2 28 dbupdatepc action 2 29 E option 2 20 errors action 2 23 group action 2 26 group csv file 2 26 N option 2 20 12 BAS5 for Xeon Maintenance Guide topo action 2 21 IBS tool 2 20 NetworkMap xml 2 20 portcounters sav 2 20 IBSDB Database 2 20 2 26 ibstat command 2 18 ibstatus command 2 18 ibtracert 2 38 Infiniband status 2 18 InfiniBand switch firmware update 5 1 INTEL_LMD_DEBUG environment variable 3 28 ioshowall command 3 21 ipmitool using 2 4 K Kernel problems 2 41 L Ictl command 3 22 licenses 3 28 Imdiag command 3 28 Ipflash command 2 13 Iptools 2 13 Iputil command 2 13 Isiocfg command 2 33 Ismod command 2 13 Lustre HA troubleshooting 3 19 Lustre failover service
137. will be written to the configured disk location using kdump Upon subsequent reboot the data will be copied from the old memory and formatted into a vmcore file and stored in the var crash subdirectory The end result can then be analysed using the crash utility An example command is shown below crash usr lib debug lib modules lt kernel_version gt vmlinux vmcore gt See Chapter 2 in the BAS5 for Xeon Installation and Configuration Guide for details on how to configure kdump Important It is essential to use non stripped binary code within the kernel Non stripped binary code is included in the debuginfo RPM available from http people redhat com duffy debuginfo index s html This package installs the kernel binary in the folder usr lib debug lib modules lt kernel_version gt 2 40 BASS for Xeon Maintenance Guide 2 5 4 Identifying problems in the different parts of a kernel Various configuration parameters enable traces or additional checks to be used on different kernel operations for example locks memory allocation and so on lt is usually possible to focus the debug mode on the problematic part of the kernel which has been identified after recompilation It is also possible to insert code e g printk to help examine the problematic part The different compilation tasks for a machine stopping starting resetting creating a dump bootstrapping a compiled system and debugging may be carried out from
138. witch IDs from database clusterdb Creating IB hosts HCA 21 ASICS 0 ISR9024 Updating equipment localisation from database clusterdb Updating equipment IP addresses from database clusterdb 3 ISR9096 0 ISR9288 2012 0 total 24 No board found boards 0 chassis 0 assigned 74 total 74 assigned 37 pairs total 37 pairs using usr local ofed bin smpquery updated 24 failed 0 total 24 using usr local ofed bin perfquery updated 74 failed 0 total 74 assigned 74 not assigned 0 total 74 Connecting to database clusterdb on host localhost 5432 Done 24 localisations updated 24 IP addresses updated 21 switch IDs updated Connecting to database ibsdb on host localhost 5432 Done Day to Day Maintenance Operations 2 27 Populating table chassis in database ibsdb O chassis stored Populating tables asic and chassis in database ibsdb 3 ISR9024 switch stored Populating table board in database ibsdb O boards stored Populating table asic in database ibsdb 0 ASICs stored Populating table hca in database ibsdb 21 HCAs stored Populating tables asic_port and hca_port in database ibsdb 74 ports stored Populating tables asic_portcounters and hca_portcounters 74 portcounters stored dbupdate Use the dbupdate action to update an existing IBSDB database In the example below the topology and traffic counter details for the iswu0c0 O managed switch from the Management Node is updat
139. y e Isiocfg P v c HBAs IDs Gives information about all SCSI controllers If HBAs IDs are specified only applies to this list of HBAs e Isiocfg P v d u devices names Gives information about SCSI devices u has to be used to display non disk devices If devices are specified only applies to this list of devices e Isiocfg p Displays partitions e Isiocfg P v a Dsplays all cdp e Isiocfg r user n remote node P y c d a Gives information from remote node about controllers disks e Isiocfg M devices names Gives information about SCSI devices usage e Isiocfg lt 1 L gt lt wwpn gt Reports WWPN owner The l flag uses etc wwn file and the L flag uses cluster manager database e Isiocfg lt w W gt Displays all WWPN owners The w flag uses etc wwn file and the W flag uses cluster manager database General flags P No headers before a c d commands y Verbose before a c d commands WWPN verbose information is extracted from etc wwn file Day to Day Maintenance Operations 2 33 h Help message Exclusive with other options V Display the version Exclusive with other options Online help and a man page give information about Isiocfg usage 2 4 4 1 HBA Inventory Using the lsiocfg HBA inventory option you can get basic information about Host Board Adapters e model e link up or down When getting HBA inventory in verbose mode mo

Download Pdf Manuals

image

Related Search

Related Contents

Muething Mulcher  Mode d`emploi tableau Excel  4 -. ,5. - Université catholique de Louvain  Tech. Datasheet1.2 MB  User`s Manual  HITACHI Projecteur CP-BW301WN Manuel d`utilisation (résumé  Philips SRP5004/53 remote control  Manual de Instruções  Philips LED TV 37PFL8605K  Samsung 19-inčni LED monitor sjajnog crnog dizajna Priručnik za korisnike  

Copyright © All rights reserved.
Failed to retrieve file