Home
Sun Fire X4500/X4540 Servers Diagnostics Guide
Contents
1. 19 OEM record e0 80000002000000000029000002 la OEM record e0 80000004000000000000b00006 1b OEM record e0 80000048000000000011110322 le OEM record e0 80000058000000000000030000 1d OEM record e0 800100440000000000fefff000 le OEM record e0 80010048000000000000ff3efa 1f 09 25 2007 03 22 06 System Boot Initiated 0x03 Initiated by warm reset Asserted 20 09 25 2007 03 22 06 Processor 0x04 Presence detected Asserted 21 09 25 2007 03 22 15 System Firmware Progress 0x01 Memory initialization Asserted 22 09 25 2007 03 22 16 Memory Uncorrectable ECC Asserted CPU 0 DIMM 0 23 09 25 2007 03 22 16 Memory Uncorrectable ECC Asserted CPU 1 DIMM 1 24 09 25 2007 03 22 16 Memory Memory Device Disabled Asserted CPU 2 DIMM 0 25 09 25 2007 03 22 16 Memory Memory Device Disabled Asserted CPU 2 DIMM 1 The lines in the display start with event numbers in hex followed by a description of the event TABLE 10 2 describes the contents of the display TABLE 10 2 Lines in IPMI Output Event hex Description 8 UCE caused a Hypertransport sync flood which lead to system s warm reset 0x02 refers to a reboot count maintained since the last AC power reset 9 BIOS detected and initiated 4 processors in system a BIOS detected a Sync Flood caused this reboot b BIOS detected a hardware error caused the Sync Flood c to le BIOS retrieved and reported some hardware eviden
2. Sun Fire X4500 Server Front kon Summary Vendor Model Count HITACHI HDS7225SBSUN250G 12 AMI Virtual CDROM 1 AMI Virtual Floppy 1 TEAC DV W516GA 1 Total Storage Devices 15 The following command displays the x64 platform type hd p platform Sun Fire X4500 Server The following command displays the cXtY device name from the Solaris PCI storage device path hd w pciG3 0 pcil022 7458Ga pcillab 1lab 1 disk 0 0 c7t0 pciG3 0 pcil022 7458Ga pcillab 11ab 1 disk 0 0 The following command displays the fdisk partition for each cXtY device name with a summary hd c s a platform Sun Fire X4500 Here is an example of output listing the fdisk partition for each cXtY device name TABLE 7 2 Output From hd Utility of an fdisk Partition Listing Device Serial Vendor Model Revision Temperature Type cOt4d0p0 K41BT4C7NXHS HITACHI HDS7225SBSUN250G V440 None Solaris2 78 Book Title without trademarks or an abbreviated book title July 2009 TABLE 7 2 Continued Output From hd Utility of an fdisk Partition Listing c5t0d0p0 K41BT4CGOPEE HITACHI HDS7225SBSUN250G V440 None Solaris2 c5t4d0p0 K41BT4C7MULS HITACHI HDS7225SBSUN250G V440 None Solaris2 c6t4d0p0 K41BT4CB6J5E HITACHI HDS7225SBSUN250G V440 None None c4t0d0p0 K41BT4CEMKHE HITACHI HDS7225SBSUN25
3. a Select the IP Assignment option that you want to use DHCP or Static a If you choose DHCP the server s IP address is retrieved from your network s DHCP server and displayed using the following format Current IP address in BMC XXX XXX XXX XXX a If you choose Static to assign the IP address manually perform the following steps i Type the IP address in the IP Address field You can also enter the subnet mask and default gateway settings in their respective fields ii Select Commit and press Return to commit the changes iii Select Refresh and press Return to see your new settings displayed in the Current IP address in BMC field 6 Start a web browser and type the service processor s IP address in the browser s URL field 7 When you are prompted for a user name and password type the following m User Name root Password changeme The Sun Integrated Lights Out Manager main GUI screen is displayed 8 Click the Remote Control tab 9 Click the Redirection tab 10 Set the color depth for the redirection console at either 6 or 8 bits 11 Click the Start Redirection button 12 When you are prompted for a user name and password type the following m User Name root Password changeme The current POST screen is displayed 52 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 Main kai kai Changing POST Options These i
4. To isolate and correct DIMM ECC errors 1 If you have not already done so shut down your server to standby power mode and remove the cover 2 Inspect the installed DIMMs to ensure that they comply with the DIMM Population Rules on page 123 3 Press the PRESS TO SEE FAULT button and inspect the DIMM fault LEDs See FIGURE 10 1 A flashing LED identifies a component with a fault m For CEs the LEDs correctly identify the DIMM where the errors were detected m For UCEs both LEDs in the pair flash if there is a problem with either DIMM in the pair Note If your server is equipped with a mezzanine board the motherboard DIMMs and LEDs will be hidden beneath it However the Motherboard Fault LED lights to indicate that there is a problem on the motherboard only while AC power is still connected If the Motherboard Fault LED on the mezzanine board lights remove the mezzanine board as described in your server s service manual and inspect the LEDs on the motherboard 4 Disconnect the AC power cords from the server Caution Before handling components attach an ESD wrist strap to a chassis ground any unpainted metal surface The system s printed circuit boards and hard disk drives contain components that are extremely sensitive to static electricity Note To recover fault information view the SP SEL Refer to the Sun Integrated Lights Out Manager User s Guide 130 Sun Fire X4500 X4540 S
5. Clear Event Log ge x x j RT R Select Screen ki Es Ben Select Item a laj Enter Go to Sub Screen ki PEL General Help ki laj F10 Save and Exit li ESC Exit KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK c From the Event Logging Details screen select View Event Log All unread events are displayed 4 View the BMC system event log a From the BIOS Main Menu screen select Advanced The Advanced Settings screen is displayed See below Chapter5 Event Logs and POST Codes 49 b From the Advanced Settings screen select IPMI 2 0 Configuration The Advanced Menu IPMI 2 0 Configuration screen is displayed Advanced kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk IPMI 2 0 Configuration View all events in the k kxkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk OK BMC Event Log Status Of BMC Working View BMC System Event Log Tt will take up to Reload BMC System Event Log 60 Seconds approx Clear BMC System Event Log to read all kai LAN Configuration BMC SEL records ai PEF Configuration fi BMC Watch Dog Timer Action Disabled laj ELER Select Screen ECER Select Item sa Enter Go to Sub Screen i FL General Help X laj F10 Save and Exit ESC Exit kkkkkkkkkkkkkkkkkkkkkkkkkk
6. Viewing the SEL With IPMItool on page 149 m Clearing the SEL With IPMItool on page 151 m Using the Sensor Data Repository SDR Cache on page 151 m Sensor Numbers and Sensor Names in SEL Events on page 152 Viewing the SEL With IPMItool There are two different IPMI commands that you can use to see different levels of detail m View the ILOM SP SEL with a minimal level of detail by using the sel list command ipmitool I lanplus H IPADDR U root P changeme sel list 100 Pre Init Time stamp Entity Presence 0x16 Device Absent 200 Pre Init Time stamp Entity Presence 0x26 Device Present 300 Pre Init Time stamp Entity Presence 0x25 Device Absent 400 Pre Init Time stamp Phys Security 0x01 Gen Chassis Chapter 12 Using IPMItool to View System Information 149 150 intrusion 500 Pre Init Time stamp Entity Presence 0x12 Device Present Note When you use this command an event record gives a sensor number but does not display the name of the sensor for the event For example in line 100 in the sample output above the sensor number 0x16 is displayed For information about how to map sensor names to the different sensor number formats that might be displayed see Sensor Numbers and Sensor Names in SEL Events on page 152 m View the ILOM SP SEL with a detailed event output by using the sel elist command instead of sel list The sel elist command cross referenc
7. An alternate method of displaying the POST codes is to redirect the output of the console to a serial port see Redirecting Console Output on page 163 This section includes the following topics m BIOS POST Memory Test Overview on page 162 m Redirecting Console Output on page 163 m Changing POST Options on page 164 m POST Codes on page 166 m POST Code Checkpoints on page 167 BIOS POST Memory Test Overview The BIOS POST memory test is performed as follows 1 The first megabyte of DRAM is tested by the BIOS before the BIOS code is shadowed that is copied from ROM to DRAM 2 Once executing out of DRAM the BIOS performs a simple memory test a write read of every location with the pattern 55aa55aa Note This memory test is performed only if Quick Boot is not enabled from the Boot Settings Configuration screen Enabling Quick Boot causes the BIOS to skip the memory test See Changing POST Options on page 164 for more information 162 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 Note Because the Sun Fire X4540 server can contain up to 64GB of memory the memory test can take several minutes You can escape from POST testing by pressing any key during POST 3 The BIOS polls the memory controllers for both correctable and uncorrectable memory errors and logs those errors into the service proces Redirecting Console Output SOT Use the following
8. Enter Go to Sub Screen kid EL General Help F W k Fe ao kad W kid KKK KK KK KKK KK KK KK KK KK KK KK KK KK KK KK KK KK KK KK KEK KEK KK KK KK KEK KK KEK KK KEK KEK KEK KK KEK KEK KEK kk kk kk 3 Select Boot Settings Configuration The Boot Settings Configuration screen is displayed Boot KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KR HA KK AN KK KEK KKK KK KKK KKK KKK KKK KKK KKKKKKKK KKK KKK KKK KEK Boot Settings Configuration Allows BIOS to skip KKK KKK KKK KKK KKK KK KEK KEK KK KKK A A A A A KKK KKK KKK KKKK KK KKK KKK OK certain tests while kaj Quick Boot Disabled booting This will System Configuration Display Disabled decrease the time Quiet Boot Disabled needed to boot the Language English system kad AddOn ROM Display Mode Force BIOS k Bootup Num Lock On ki Wait For F1 If Error Disabled Interrupt 19 Capture Disabled x k k BLAN Select Screen x BONE Select Item k Change Option 4 4 pi General Help W FILO Save and Exit ESC Exit W KKK KKK KKK KKK KKK KKK KKK KKK KKK AH AH HA AA AH KANA KKK ANA AN KK KEKE KKK KKK KKK KKK KKK KK KKK KKK KKK KEK 4 On the Boot Settings Configuration screen there are several options that you can enable or disable m Quick Boot This option is disabled by default If you enable this the BIOS skips certain tests while booting such as the extensive memory test This decreases
9. P anonymous user list Chapter 4 Using IPMItool to View System Information 33 34 Changing the Default Password You can also change the default passwords for a particular user ID First get a list of users and find the ID for the user you wish to change Then supply it with a new password as shown in the following command sequence ipmitool I lanplus H IPADDR U root P changeme user list ID NameCallin Link Auth IPMI Msg Channel Priv Limit 1 false false true NO ACCESS 2 root false false true ADMINISTRATOR ipmitool I lanplus H IPADDR U root P changeme user set password 2 newpass ipmitool I lanplus H IPADDR U root P newpass chassis status Configuring an SSH Key You can use IPMItool to configure an SSH key for a remote shell user To do this first determine the user ID for the desired remote SP user with the user list command ipmitool I lanplus H IPADDR U root P changeme user list Then supply the user ID and the location of the RSA or DSA public key to use with the ipmitool sunoem sshkey command For example ipmitool I lanplus H IPADDR U root P changeme sunoem sshkey set 2 id rsa pub Setting SSH key for user id 2 done You can also clear the key for a particular user for example ipmitool I lanplus H IPADDR U root P changeme sunoem sshkey del 2 Deleted SSH key for user id 2 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 Using IPMItool to Re
10. T time stamps in ILOM SP SEL 23 137 troubleshooting flow chart 2 110 guidelines 2 111 U uncorrectable errors handling 91 185 Index 203 204 Undefined BookTitleFooter July 2009
11. The content of the selected log file is displayed in the window c With the three lower buttons you can do the following actions a Print the log file A dialog box appears for you to specify your printer options and printer name Delete the log file The file remains displayed but will be gone the next time you try to display it Close the Log file window The window is closed Note To save the log files You must save the log files to another networked system or a removable media device When you use the Bootable Diagnostics CD the server boots from the CD Therefore the test log files are not on the server s hard disk drive and they will be deleted when you power cycle the server 18 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 CHAPTER 3 Using the ILOM Service Processor GUI to View System Information This appendix contains information about using the Integrated Lights Out Manager ILOM service processor SP GUI to view monitoring and maintenance information for your server It includes the following sections m Making a Serial Connection to the SP on page 20 m Viewing ILOM SP Event Logs on page 21 m Viewing Replaceable Component Information on page 24 m Viewing Temperature Voltage and Fan Sensor Readings on page 26 For more information on using the ILOM SP GUI to maintain the server for example configuring alerts refer to the Integrated Lights Ou
12. s power supplies FIGURE 8 3 shows AC power cords on the rear panel 2 Check that the server covers including hard disk drive access cover system controller cover and fan access cover are firmly in place Refer to the cover labels An intrusion switch on the system controller shuts the server down when the hard disk drive access cover is removed 3 Investigate the conditions that can trigger an automatic shutdown sequence A power off sequence is initiated by a request from either of the following items Board management controller BMC The conditions that trigger the BMC to issue a shutdown request are m An over temperature condition for more than 1 second a Multiple fan failures or m Fault condition The fault conditions that trigger a shutdown are m All power supplies have failed or have been removed m A power supply has been out of spec for more than 100 mS The hot swap circuit has faulted m An over temperature condition has occurred Note Any power supply that is out of spec causes a reset but only power supplies that remain out of spec for more than 100 mS cause a shutdown External Inspection of the Server Improperly set controls and loose or improperly connected cables are common causes of problems with hardware components To perform a visual inspection of the external system 1 Inspect the front panel LEDs for indications of component malfunction FIGURE 8 2 shows the front panel controls
13. 2 Locate button LED White Operators can turn this LED On remotely to help then locate the server in a crowded server room Press to turn off 3 Fault LED Amber When on service action required Steady Power is On Off Power is Off 4 OK LED Green Service action allowed When On service action is required Blink Standby power is On but main power is Off 5 System controller Blue Ready to remove status LEDs Amber Fault service action required Green Operational no action required For additional LED locations and descriptions see Identifying Status and Fault LEDs on page 171 3 Verify that nothing in the server environment is blocking air flow or making a contact that could short out power 114 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 4 If the problem is not evident continue with the next section Internal Inspection of the Server on page 115 Internal Inspection of the Server To perform a visual inspection inside the server 1 Shut down the server from main power to standby power mode Choose one of the following methods using a non conducting ballpoint pen or stylus See FIGURE 8 4 Graceful shutdown Press and release the Power button on the front panel Pressing the power button causes Advanced Configuration and Power Interface ACPI enabled operating systems to perform an orderly shutdown of the operating system Servers not running ACPI
14. About IPMItool 32 IPMItool Man Page 32 Connecting to the Server With IPMItool 33 Enabling the Anonymous User 33 Changing the Default Password 34 Configuring an SSH Key 34 Using IPMItool to Read Sensors 35 Reading Sensor Status 35 Reading All Sensors 35 Reading Specific Sensors 36 Using IPMItool to View the ILOM SP System Event Log 38 Viewing the SEL With IPMItool 38 Clearing the SEL With IPMItool 40 Undefined BookTitleFooter July 2009 Using the Sensor Data Repository SDR Cache 40 Sensor Numbers and Sensor Names in SEL Events 40 Viewing Component Information With IPMItool 41 Viewing and Setting Status LEDs 42 LED Sensor IDs 42 LED Modes 44 LED Sensor Groups 44 Using IPMItool Scripts For Testing 45 Event Logs and POST Codes 47 Viewing Event Logs 47 Power On Self Test POST 50 How BIOS POST Memory Testing Works 51 Redirecting Console Output 51 Changing POST Options 53 v To Change POST Options 53 POST Codes 55 POST Code Checkpoints 57 Status Indicator LEDs 61 External Status Indicator LEDs 61 Exterior Features Controls and Indicators 62 Front Panel 62 Rear Panel 64 Internal Status Indicator LEDs 66 Disk Drive and Fan Tray LEDs 67 CPU Board LEDs 68 hd Utility 71 Overview of the hd Utility 71 Contents v Using the hd Utility 73 hd Utility Mapping 73 hd Command Options and Parameters 73 hd Man page 74 Options Parameters 74 Example Using the hd Utility 77 Sun Fire X4500 Disk Mapping 81 Pre ILOM 2 0 2 5 an
15. Accessing SunVTS SunVTS software has a graphical user interface GUI that provides test configuration and status monitoring The user interface can be run on one system to display the SunVTS testing of another system on the network SunVTS software also provides a TTY mode interface for situations in which running a GUI is not possible SunVTS Documentation For the most up to date SunVTS documentation go to http www sun com oem products vts Running Sun VTS Diagnostic Tests Using the Bootable Diagnostics CD Use the Bootable Diagnostics CD to diagnose server problems This CD is designed so that the server will boot from the CD This CD boots the Solaris operating system and starts SunVTS software Diagnostic tests run and write output to log files that the service technician can use to determine the problem with the server SunVTS 7 0 or later software is preinstalled on these Sun Fire X4540 servers The server is also shipped with the Sun Fire X4540 Server Bootable Diagnostics CD Part number 705 1439 SunVTS Log Files SunVTS provides access to four different log files m SunVTS test error log contains time stamped SunVTS test error messages The log file path name is var sunvts logs sunvts err This file is not created until a SunVTS test failure occurs 120 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 m SunVTS kernel error log contains time stamped SunVTS kernel and SunVTS probe errors SunVT
16. IPADDR U root P changeme sel get 0x0a00 SEL Record ID 0a00 Record Type 02 Timestamp 07 06 1970 01 53 58 Generator ID 0020 EvM Revision 04 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 Sensor Type Entity Presence Sensor Number 12 Event Type Generic Discrete Event Direction Assertion Event Event Data RAW O1ffff Description Device Present Sensor ID psO prsnt 0x12 Entity ID 10 0 Sensor Type Discrete Entity Presence States Asserted Availability State Device Present In the example above this particular event describes that Power Supply 0 is detected and present Clearing the SEL With IPMItool To clear the SEL type the sel clear command ipmitool I lanplus H IPADDR U root P changeme sel clear Clearing SEL Please allow a few seconds to erase Using the Sensor Data Repository SDR Cache When working with the ILOM SP certain operations can be expensive in terms of execution time and the amount of data transferred Typically issuing the sar elist command requires the entire SDR to be read from the SP Similarly the sel elist command needs to read both the SDR and the SEL from the SP in order to cross reference events and display useful information To speed up these operations it is possible to pre cache the static data in the SDR and feed it back into IPMItool This can have a dramatic effect in the processing time for some commands
17. In order to generate an SDR cache for later use type the sar dump command For example ipmitool I lanplus H IPADDR U root P changeme sdr dump galaxy sdr Dumping Sensor Data Repository to galaxy sdr After you have generated a cache file it can be supplied to future invocations of IPMItool with the s option For example ipmitool I lanplus H IPADDR U root P changeme S galaxy sdr sel elist Chapter 12 Using IPMItool to View System Information 151 100 Pre Init Time stamp Entity Presence psl prsnt Device Absent 200 Pre Init Time stamp Entity Presence io f0 prsnt Device Absent 300 Pre Init Time stamp Power Supply ps0 vinok State Asserted Sensor Numbers and Sensor Names in SEL Events Depending on which IPMI command you use the sensor number that is displayed for an event might appear in slightly different formats See the following examples m The sensor number for the sensor psl prsnt power supply 1 present can be displayed as either 1Fh or 0x1F m 38h is equivalent to 0x38 m 4Bh is equivalent to 0x4B The output from certain commands might not display the sensor name along with the corresponding sensor number To see all sensor names in your server mapped to the corresponding sensor numbers you can use the following command ipmitool H 129 144 82 21 U root P changeme sdr elist sys id ooh ok 23 0 State Asserted sys intsw O1h ok 23 0 sys psfail 02h ok 23
18. TABLE 6 3 lists the internal LEDs TABLE 6 2 Internal LEDs Name Color Function DiskDrives Status Green Blinking data is transfering unit is OK Fault Amber Fault service action is required Ready to Remove Blue Status Green Fault Amber Unit is ready to remove Service action allowed Fan Trays Unit is OK Fault service action is required CPU See FIGURE 6 6 LEDs are active only when the Remind button is pressed DIMM Failure Amber CPU Failure Amber Battery Failure Amber Blinks to indicate that the system has found a fault with the DIMM Restart system to clear fault Blinks to indicate that the system has found a fault with a CPU Restart system to clear fault Blinks to indicate that the system has found a fault with the battery Start service processor to clear fault Chapter 6 Status Indicator LEDs 65 Disk Drive and Fan Tray LEDs FIGURE 6 4 shows the location of the internal LEDs FIGURE 6 5 shows a close up view of the disk drive and fan trays including the symbols that identify the LEDs FIGURE 6 4 Disk Drive and Fan Tray LEDs 66 Book Title without trademarks or an abbreviated book title July 2009 FIGURE 6 5 Disk Drive and Fan Tray LEDs Ready to Remove Fault Service action e A Service action allowed required ok ox CPU Board LEDs The CPU board has three types of LEDs They are listed in TABLE 6 3 and appear in FIGURE 6 6 Chapter 6 Status Indicat
19. 1 for a page that shows sample information FIGURE 3 1 System Event Logs Page ABOUT REFRESH LOG OUT u root Administrator P020000970192 Sun Integrated Lights Out Manager _ System Information System Monitoring Configuration User Management Remote Control Maintenance r Sensor Readings Event Logs Locator Indicator System Event Logs View sensor specific BIOS generated or system management software event logs Select an event log category Sensor Specific Events yi Event Log 4 event entries Event ID Time Stamp Sensor Name Sensor Type Description 4 12 3141 969 16 01 01 ps1 vinok Power Supply State Asserted Asserted 3 12 31 1969 16 01 01 ps0 prsnt Entity Presence Device Removed Device Absent Asserted 2 12 31 1969 16 00 57 pst prsnt Entity Presence Device Inserted Device Present Asserted 1 12131 1969 16 00 56 ps1 pwrok Power Supply State Deasserted Asserted Chapter 3 Using the ILOM Service Processor GUI to View System Information 21 3 Select a category of event that you want to view in the log from the drop down menu You can select from the following types of events m Sensor specific events These events relate to a specific sensor for a component for example a fan sensor or a power supply sensor m BlOS generated events These events relate to error messages generated in the BIOS m System management software events These events relate to events that occu
20. 2 Failed p0 d3 led CPU 0 DIMM 3 Failed pl led CPU 1 Failed p1 d0 led CPU 1 DIMM 0 Failed p1 d1 led CPU 1 DIMM 1 Failed p1 d2 led CPU 1 DIMM 2 Failed p1 d3 led CPU 1 DIMM 3 Failed ft0 fm0 led Fan Tray 0 Module 0 Failed ft0 fm1 led Fan Tray 0 Module 1 Failed ft0 fm2 led Fan Tray 0 Module 2 Failed ftl fm0 led Fan Tray 1 Module 0 Failed ftl fm1 led Fan Tray 1 Module 1 Failed ftl fm2 led Fan Tray 1 Module 2 Failed LED Modes You supply the modes in TABLE 12 3 to the 1ed set commands to specify the mode in which you want the LED to be placed TABLE 12 3 LED Modes Mode Description OFF LED off ON LED steady on STANDBY 100 ms on 2900 ms off SLOW 1 Hz blink rate FAST 4 Hz blink rate Chapter 12 Using IPMItool to View System Information 155 LED Sensor Groups Because each LED has its own sensor and can be controlled independently there is some overlap in sensors In particular there are separate LEDs defined for the power locate and alert LEDs on the front and back panels It is desirable to have these sensors linked so that both the front and back panel LEDs can be controlled at the same time This is handled through the use of Entity Association Records These are records in the SDR that contain a list of entities that are considered part of a group For each Entity Association Record we also define another Generic Device Locator as a logical entity to indicate to system software that it refers to a group of LED
21. CPU is installed in a system controller with an outdated BIOS In this case the BIOS must be updated BIOS POST CMOS contents The BIOS displays an error message logs the CMOS failed the error to DMI and boots Checksum Bad Checksum check Unsupported The BIOS supports The BIOS displays an error message logs the CPU mismatched error and halts the system configuration frequency and steppings in CPU configuration but some CPUs might not be supported Correctable The CPU detects a The CPU corrects the error in hardware No error variety of interrupt or machine check is generated by correctable errors in the hardware The polling is triggered every the MCi_STATUS half second by SMI timer interrupts and is registers done by the BIOS SMI handler The SMI handler logs a message to the SP SEL if the SEL is available otherwise SMI logs a message to DMI The BIOS s polling can be disabled through software SMI Single fan Fan failure is The Front Fan Fault Service Action Required failure detected by reading and individual fan module LEDs are lit tach signals 104 Book Title without trademarks or an abbreviated book title July 2009 DMI Log Non fatal DMI Log Non fatal DMI Log Fatal DMI Log Normal SP SEL operation SP SEL Non fatal TABLE B 1 Hardware Error Handling Summary Continued Logged DMI Log or SP Error Description Handling SEL Fatal Multiple fan Fan failure is The Front Fan Fault
22. Guide July 2009 pl v 1v25core 37h ok 3 1 1 32 Volts t0 fm0 0 speed 43h ok 29 0 6000 RPM f t0 fml 0 speed 44h ok 29 1 6000 RPM f t0 fm2 0 speed 45h ok 29 2 6000 RPM ftl fm0 0 speed 46h ok 29 3 6000 RPM ftl fmi fO speed 47h ok 29 4 6000 RPM ftl fm2 fO speed 48h ok 29 5 6000 RPM You can also generate a list of all sensors for a specific entity Use the list output to determine which entity you are interested in seeing then use the sar entity command to get a list of all sensors for that entity This command accepts an entity ID and an optional entity instance argument If an entity instance is not specified it will display all instances of that entity The entity ID is given in the fourth field of the output as read from left to right For example in the output shown in the previous example all the fans are entity 29 The last fan listed 29 5 is entity 29 with instance 5 ftl fm2 fO speed 48h ok 29 5 6000 RPM For example to see all fan related sensors type the following command with the entity 29 argument ipmitool I lanplus H IPADDR U root P changeme sdr entity 29 ft0O fm0 fail 3Dh ok 29 0 Predictive Failure Deasserted ft0O fm0 led 00h ns 29 0 Generic Device 20h 19h 0 ftO fmi fail 3Eh ok 29 1 Predictive Failure Deasserted ft0O fm1 led 00h ns 29 1 Generic Device 20h 19h 1 ftO fm2 fail 3Fh ok 29 2 Pre
23. HyperTransport error links The system reboots the BIOS recovers the machine check register information maps this information to the failing DIMM when CHIPKILL is disabled or DIMM pair when CHIPKILL is enabled and logs that information to the SP The BIOS will halt the CPU Unsupported Unsupported The BIOS displays an error message logs an DMI Log Fatal DIMM DIMMs are used or error and halts the system SP SEL configuration supported DIMMs are loaded improperly HyperTranspor CRC or link error Sync floods on HyperTransport links the DMI Log Fatal t link failure on one of the machine resets itself and error information SP SEL HyperTransport Links gets retained through reset The BIOS reports A Hyper Transport sync flood error occurred on last boot press F1 to continue Appendix B Error Handling 103 TABLE B 1 Hardware Error Handling Summary Continued Logged DMI Log or SP Error Description Handling SEL Fatal PCI SERR System or parity Sync floods on HyperTransport links the DMI Log Fatal PERR error ona PCI bus machine resets itself and error information SP SEL gets retained through reset The BIOS reports A Hyper Transport sync flood error occurred on last boot press F1 to continue BIOS POST The BIOS could not The BIOS displays an error message logs the Microcode find or load the error to DMI and boots Error CPU Microcode Update to the CPU The message most likely appears when a new
24. Network Circle Santa Clara California 95054 Etats Unis Tous droits r serv s Cette distribution peut incluire des lements d velopp s par des tiers Sun Sun Microsystems le logo Sun Java Netra Solaris Sun Ray et Sun Fire X4540 Backup Server sont des marques de fabrique ou des marques d pos es de Sun Microsystems Inc ou ses filiales aux Etats Unis et dans d autres pays Ce produit est soumis la l gislation am ricaine sur le contr le des exportations et peut tre soumis a la r glementation en vigueur dans d autres pays dans le domaine des exportations et importations Les utilisations finales ou utilisateurs finaux pour des armes nucl aires des missiles des armes biologiques et chimiques ou du nucl aire maritime directement ou indirectement sont strictement interdites Les exportations ou reexportations vers les pays sous embargo am ricain ou vers des entit s figurant sur les listes d exclusion d exportation am ricaines y compris mais de maniere non exhaustive la liste de personnes qui font objet d un ordre de ne pas participer d une facon directe ou indirecte aux exportations des produits ou des services qui sont r gis par la l gislation am ricaine sur le contr le des exportations et la liste de ressortissants sp cifiquement d sign s sont rigoureusement interdites L utilisation de pi ces d tach es ou d unit s centrales de remplacement est limit e aux r parations ou a l change standard d unit s cent
25. Refresh button to update the sensor readings to their current status 5 Click the Show Thresholds button to display the settings that trigger alerts The Sensor Readings table is updated See the example in FIGURE 3 4 For example if system temperature reaches 30 C the service processor will send an alert Sensor thresholds include the following m Low High NR Low or high non recoverable a Low High CR Low or high critical a Low High NC Low or high non critical 28 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 FIGURE 3 4 Sensor Readings Page With Thresholds Displayed L Administrator S REFRESH LOG OUT Sensor Readings Event Logs Locator Indicator Sensor Readings View readings for temperature voltage or fan sensors Select a sensor type category All Sensors Yi Sensor Readings 77 sensors Reading A Predictive Failure sys tempfail Deasserted Predictive Failure sys fanfail Al Deasserted Normal mb t_amb 24 degrees C Normal mb y bat 3 232 Volts Normal mb v_ 3y3stby 3 217 Volts Unknown mb _ 3v3 Not Available Unknown mb v_ 5v Not Available 18 degrees C 2 192 Volts 2 595 Volts 2 595 mhy 12v Not Available 894A 20 degrees C 2 496 Volts 2 785 Volts 2 785 3 978 9954 22 degrees C 2 688 Volts 2 992 Volts 2 992 4 498 10962 35 degrees C 3 392 Volts 3 598 Volts 3 598 5 486 12 978 40 degrees C 3 6 Volts 3 788 Volts 3 788 5 98 Refresh
26. a The BIOS logs to DMI a The BIOS logs to the SP SEL through the BMC m The feature is turned off at OS boot time by default m Solaris support provides full self healing and automated diagnosis for the CPU and Memory subsystems m FIGURE D 2 shows an example of a DMI log screen from BIOS Setup Page FIGURE D 2 DMI Log Screen Correctable Error BIOS SETUP UTILITY View Event Log 09 12 05 12 33 16 ECC on Node 1 DIMM Pair 0 SPD address OAGh GAZh WA Y YA 6 j3 1b Single Bit ECC Memory Error m If during any stage of memory testing the BIOS finds itself incapable of reading or writing to the DIMM it takes the following actions m The BIOS disables the DIMM as indicated by the Memory Decreased message in the example in FIGURE D 3 a The BIOS logs an SEL record 188 Book Title without trademarks or an abbreviated book title July 2009 m The BIOS logs an event in DMI FIGURE D 3 DMI Log Screen Correctable Error Memory Decreased BIOS SETUP UTILITY 09 12 05 13 30 00 Memory decreased in 09 12 05 13 29 54 ECC on Node 1 DIMM Pair 0 SPD addres 09 12 05 13 29 54 Memory Error Single Bit ECC v02 53 C Copyright 1985 gt American Megatrends Inc OACh OAZh Appendix D Error Handling 189 190 Parity Errors PERR This section lists facts and considerations about how the server handles parity errors PERR m The handling of parity errors works through NMIs m During BIOS POST
27. and password type the following User Name root Password changeme The Sun Integrated Lights Out Manager main GUI screen is displayed Click the Remote Control tab Click the Redirection tab Set the color depth for the redirection console at either 6 or 8 bits Click the Start Redirection button When you are prompted for a user name and password type the following User Name root Password changeme The current POST screen is displayed Changing POST Options These instructions are optional but you can use them to change the operations that the server performs during POST testing Vv To Change POST Options 1 Initialize the BIOS Setup utility by pressing the F2 key while the system is performing the power on self test POST The BIOS Main menu screen is displayed 2 Select Boot The Boot Settings screen is displayed 164 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 Main Advanced PCIPnP Boot Security Chipset Exit KKK KKK KKK KKK KKK KKK KKK KKK KKK AA AA HA AA KKK KR KKK A AN KEKE KKK KEK KK KKK AAA KEK A AA KKK KKK KKK KKK KKK F10 Save and Exit ESC Exit Boot Settings Configure Settings KKK KKK KKK KKK KK KKK KEKE KK KEK KKK A A A A A A MA KEK KKK KKK KKKKKKKKEKEK OK during System Boot Boot Settings Configuration ve Boot Device Priority Hard Disk Drives w ki Ne Ki JIK Select Screen 4 AVR Select Item ki
28. causes an immediate reboot of the system 2 During reboot the BIOS checks the Machine Check registers and determines that the previous reboot was due to an UCE then reports this message in POST after the memtest stage A Hypertransport Sync Flood occurred on last boot 3 BIOS reports this event in the service processor s system event log SEL as shown in the sample IPMItool output below ipmitool H 10 6 77 249 U root P changeme I lanplus sel list 8 09 25 2007 03 22 03 System Boot Initiated 0x02 Initiated by warm reset Asserted 9 09 25 2007 03 22 03 Processor 0x04 Presence detected Asserted a 09 25 2007 03 22 03 OEM 0x12 Asserted b 09 25 2007 03 22 03 System Event 0x12 Undetermined system hardware failure Asserted c OEM record e0 00000002000000000029000002 d OEM record e0 00000004000000000000b00006 e OEM record e0 00000048000000000011110322 f OEM record e0 00000058000000000000030000 10 OEM record e0 000100440000000000fef 000 11 OEM record e0 00010048000000000000ff3efa 12 OEM record e0 10ab0000000010000006040012 13 OEM record e0 10ab0000001111002011110020 14 OEM record e0 0018304c00f200002000020c0f 15 OEM record e0 0019304c00f200004000020c0f 16 OEM record e0 001a304c00f45aa10015080a13 17 OEM record e0 001a3054000000000320004880 18 OEM record e0 001b304c00 200001000020c0F Chapter 10 Troubleshooting DIMM Problems 125
29. devices present For example three USB DVD gets c2 c3 and c4 names at the time of OS installation TABLE 7 8 Sun Fire X4500 Disk Mapping Three USB Storage Device ILOM 2 0 2 5 USB CD USB USB ROM Floppy Device Controller 3 Controller 2 Controller 5 Controller 4 Controller 1 Controller 0 36 37 38 39 40 41 42 43 44 45 46 47 c6t3 c6t7 c5t3 c5t7 c8t3 c8t7 e7t3 e7t7 elt3 elt7 cOt3 cOt7 24 25 26 27 28 29 30 31 32 33 34 35 c2ty c3ty c4ty c6t2 c6t6 c5t2 c5t6 c8t2 c8t6 c7t2 e7t6 elt2 elt6 c0t2 aat6 12 13 14 15 16 17 18 19 20 21 22 23 c6tl c6t5 c5t1 c5t5 e8ti c8t5 c7t1 e7t5 c1t1 elt5 c0t1 c0t5 0 1 2 3 4 5 6 7 8 9 10 11 c6t0 c6t4 c5t0 c5t4 c8t0 c8t4 e7t0 e7t4 c1t0 elt4 cOtO eat4 c2 c3 c4 c6 c5 c8 c7 c1 co Asterisk denotes the device names c2 c3 and c4 mapped to the three USB devices on the system 86 Book Title without trademarks or an abbreviated book title July 2009 APPENDIX A Sun Fire X4500 Sensor Locations This appendix lists the locations of the sensors of the Sun Fire X4500 server TABLE A 1 Name of Sensor Location of Sensor dbp t_amb Disk Backplane ft0 prsnt Fan board ft0 f0 speed Fan board ft0 f1 speed Fan board ftl prsnt Fan board ftl f0 speed Fan board ftl fl speed Fan board ft2 prsnt Fan board ft2 f0 speed Fan board ft2 fl speed Fan board ft3 prsnt Fan board ft3 f0 speed Fan board ft3
30. fan on isc you would use it ina command as follows ipmitool I lanplus H IPADDR U root P changeme exec leds_fan_on isc Chapter 12 Using IPMItool to View System Information 157 158 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 CHAPTER 13 Event Logs and POST Codes This chapter contains information about the BIOS event log the BMC system event log the power on self test POST and console redirection For more information on the BIOS event log and post codes refer to the Sun Fire X4540 Server Service Manual 819 4359 This chapter includes the following topics m Viewing Event Logs on page 159 m About Power On Self Test POST on page 162 a BIOS POST Memory Test Overview on page 162 a Redirecting Console Output on page 163 a Changing POST Options on page 164 a POST Codes on page 166 a POST Code Checkpoints on page 167 Viewing Event Logs To view the BIOS event log and the BMC system event log Turn on main power so that all components are powered on Use a non conducting ball point pen or stylus to press and release the Power button on the server front panel See FIGURE 8 4 When main power is applied to the full server the Power OK LED next to the Power button lights and remains lit Enter the BIOS Setup utility by pressing the F2 key while the system is performing the power on self test POST The BIOS Main menu screen is displa
31. fl speed Fan board ft4 prsnt Fan board ft4 f0 speed Fan board ft4 fl speed Fan board ps0 prsnt Power supply ps0 pwrok Power supply ps0 vinok Power supply 88 TABLE A 1 Name of Sensor Location of Sensor psi prsnt psl pwrok psl vinok ps2 prsnt ps2 pwrok ps2 vinok io front t amb io rear t amb io v 1v5 io v_ 2v5 io v_ 5v_disk io v_ 12v proc p0 t_core proc p1 t_core proc front t_amb proc rear t_amb proc p0 v_ 1v25 proc pO v 1v5 proc p0 v_ 2v5 proc pl v 1v25 proc pl v 1v5 proc pl v_ 2v5 proc v 1v8 sys v_ 12v sys v 1v2 sys v_ 3v3 sys v_ 3v3stby sys v_ 5v sys v_bat bp locate btn Power supply Power supply Power supply Power supply Power supply Power supply IO controller board IO controller board IO controller board IO controller board IO controller board IO controller board CPU board CPU board CPU board CPU board CPU board CPU board CPU board CPU board CPU board CPU board CPU board CPU board CPU board CPU board CPU board CPU board CPU board Not a sensor but a locate button On rear backplane Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 TABLE A 1 Name of Sensor Location of Sensor fp prsnt fp locate btn hdd x state sys intsw sys acpi Disk backplane Not a sensor but a locate button On rear backplane Software sensors No corresponding hardware Status of the sensors are set by app running on
32. or an abbreviated book title July 2009 Handling Mismatching Processors This section lists facts and considerations about how the server handles mismatching processors m The BIOS performs a complete POST m The BIOS displays a report of any mismatching CPUs as shown in the following example AMIBIOS C 2003 American Megatrends Inc BIOS Date 08 10 05 14 51 11 Ver 08 00 10 CPU AMD Opteron tm Processor 254 Speed 2 4 GHz Count 3 CPU Revision CPUO E4 CPU1 E6 Microcode Revision CPUO 0 CPU1 0 DRAM Clocking CPUO 400 MHz CPU1 Core0 1 400 MHz Sun Fire X4500 Server 1 AMD North Bridge Rev E4 1 AMD North Bridge Rev E6 1 AMD 8111 I O Hub Rev C2 2 AMD 8131 PCI X Controllers Rev B2 System Serial Number O505AMF028 BMC Firmware Revision 1 00 Checking NVRAM Initializing USB Controllers Done Press F2 to run Setup CTRL E on Remote Keyboard Press F12 to boot from the network CTRL N on Remote Keyboard Press F8 for BBS POPUP CTRL P on Remote Keyboard m No SEL or DMI event is recorded m The system enters Halt mode and the following message is displayed KEKRRRAZ Warning Bad Mix of Processors e Multiple core processors cannot be installed with single core processors Fatal Error System Halted Appendix B Error Handling 101 Hardware Error Handling Summary TABLE B 1 summarizes the most common hardware errors that you might encounter with t
33. page 172 m FIGURE 14 3 Sun Fire X4540 Server Rear Panel on page 174 and TABLE 14 2 Rear Panel Features on page 174 171 Front Panel Features FIGURE 14 1 shows the front panel FIGURE 14 2 shows the controls and indicator details FIGURE 14 1 describes the controls and indicators FIGURE 14 1 Sun Fire X4540 Server Front Panel Features 1 2 L oa L ow onmo 4 Figure Legend 1 Locate button 2 Power OK LED 3 USB ports 2 172 Book Title without trademarks or an abbreviated book title July 2009 FIGURE 14 2 Sun Fire X4540 Server Front Panel Controls and Indicators 3 1 2 4 5 6 7 TABLE 14 1 Front Panel Controls and Indicators Name Color Description 1 Locate White Operators can turn this LED On remotely to help then button LED locate the server in a crowded server room Press to turn off Pressing the Locate LED Switch for five seconds turns all indicators ON for 15 seconds 2 System Fault White On When service action is required 3 Power Operation Green Steady Power is On Blink Standby power is On but main power is Off Off Power is Off 4 System power Grey To power on main power for all the server components button 5 Top failure LED Amber On HDD or fan fault 6 Rear failure LED Amber On Power supply or system controller fault service is required 7 Over Temperature Amber On When system is over temperature LED Chapter 14 Identifying Status and F
34. press Enter or click the Start button when you are prompted to start the tests The test suite will run until it encounters an error or the test is completed Note The CD will take approximately nine minutes to boot 9 When SunVTS software completes the test review the log files generated during the test SunVTS provides access to four different log files m SunVTS test error log contains time stamped SunVTS test error messages The log file path name is var opt SUNWvts logs sunvts err This file is not created until a SunVTS test failure occurs m SunVTS kernel error log contains time stamped SunVTS kernel and SunVTS probe errors SunVTS kernel errors are errors that relate to running SunVTS and not to testing of devices The log file path name is var opt SUNwvts logs vtsk err This file is not created until SunVTS reports a SunVTS kernel error m SunVTS information log contains informative messages that are generated when you start and stop the SunVTS test sessions The log file path name is var opt SUNwvts logs sunvts info This file is not created until a SunVTS test session runs m Solaris system message log is a log of all the general Solaris events logged by syslogd The path name of this log file is var adm messages Chapter 2 Using SunVTS Diagnostic Software 17 a Click the Log button The Log file window is displayed b Specify the log file that you want to view by selecting it from the Log File window
35. rpm temperature and voltage measurements Click the Refresh button to update the sensor readings to their current status Click the Show Thresholds button to display the settings that trigger alerts The Sensor Readings table is updated See the example in FIGURE 11 4 For example if system temperature reaches 30 C the service processor will send an alert Sensor thresholds include the following m Low High NR Low or high non recoverable m Low High CR Low or high critical a Low High NC Low or high non critical Using the ILOM Service Processor GUI to View System Information 141 FIGURE 11 4 Sensor Readings Page With Thresholds Displayed ABOUT l REFRESH LOG OUT iini 0000 Integrated Lights Out Manager Sensor Readings Event Logs Locator Indicator Sensor Readings View readings for temperature voltage or fan sensors Select a sensor type category fan Sensors Yi Sensor Readings 77 sensors Reading A Predictive Failure sys tempfail Deasserted Predictive Failure sys fanfail 4 Deasserted Normal mb t_amb 24 degrees C 18 degrees C 20 degrees C 22 degrees C 36 degrees C 40 degrees C Normal mb y bat 3 232 Volts 2 192 Volts 2 496 Volts 2 688 Volts 3 392 Volts 3 6 Volts Normal mb v_ 3yv3stby 3 217 Volts 2 595 Volts 2 785 Volts 2 992 Volts 3 598 Volts 3 788 Volts Unknown mb v_ 3v3 Not Available 2 595 2 785 2 992 3 598 3 788 Unknown mb v_ 5v Not Available 3 484 3 978 4 498 5 486 5 9
36. sections Uncorrectable Errors on page 185 Correctable Errors on page 188 Parity Errors PERR on page 190 System Errors SERR on page 192 Handling Mismatched Processors on page 195 Hardware Error Handling Summary on page 196 Uncorrectable Errors This section lists facts and considerations about how the server handles uncorrectable errors Note The BIOS ChipKill feature must be disabled if you are testing for failures of multiple bits within a DRAM ChipKill corrects for the failure of a four bit wide DRAM The BIOS logs the error to the SP system event log SEL through the board management controller BMC The SP s SEL is updated with the failing DIMM pair s specific bank address The system reboots The BIOS logs the error in DMI and SP event logs 185 Note If the error is on low 1MB the BIOS freezes after rebooting Therefore no DMI log is recorded m Anexample of the error reported by the SEL through IPMI 2 0 is as follows a When low memory is erroneous the BIOS is frozen on pre boot low memory test because the BIOS cannot decompress itself into faulty DRAM and execute the following items ipmitool gt sel list 100 08 26 2005 11 36 09 OEM 0xfb 200 08 26 2005 11 36 12 System Firmware Error No usable system memory 300 08 26 2005 11 36 12 Memory Memory Device Disabled CPU 0 DIMM 0 a When the faulty DIMM is beyond the BIO
37. the feature it lists the storage devices with their logical device names serial numbers vendor model and drive temperatures Identifies x64 platform type based on the x64 storage host controllers Displays x64 Sun Fire X4500 server platform mapping type regardless of platform type in bypass mode Probes the system in regular mode This is the default mode for the utility The utility maps all hard drives in the Solaris OS logical device name to physical slot numbers that are shown on the Sun Fire X4500 server chassis label There are three status rows for each device e Physical slot or location that matches the chassis label e Logical location that matches the Solaris OS storage device name e cXtY Drive runtime status The following syntax is used Up arrow Indicates the device Device is present and accessible Device is not accessible absent or empty Chapter 7 hd Utility 75 TABLE 7 1 hd Options Continued Option Description Devices under the controller are not enumerated The controller is not enumerated until there are drives in the slots connected to the controller H Devices received warning messages from the storage subsystem b Drive slot is bootable if an OS is installed on the drive d Diagnoses the system by scanning the syslog dmesg for any disk s warning messages If there is a disk related warning message the utility maps the physical location of the drive with the warning mes
38. the NMI is logged in the DMI and the SP SEL See the following example command and output root d mpk12 53 238 root ipmitool H 129 146 53 95 U root P changeme I lan sel list v SEL Record ID Record Type Timestamp Generator ID EvM Revision Sensor Type Sensor Number Event Type Event Direction Event Data Description 0100 00 01 10 2002 20 16 16 0001 04 Critical Interrupt 00 Sensor specific Discrete Assertion Event O4ff00 PCI PERR m FIGURE D 4 shows an example of a DMI log screen from BIOS Setup Page with a parity error Book Title without trademarks or an abbreviated book title July 2009 FIGURE D 4 DMI Log Screen PCI Parity Error BIOS SETUP UTILITY View Event Log Cle vent Log View Event Log 09 12 05 14 27 47 PCI Parity m The BIOS displays the following messages and freezes during POST or DOS m NMI EVENT m System Halted due to Fatal NMI m The Linux NMI trap catches the interrupt and reports the following NMI confusion report sequence Aug 5 05 15 00 d mpk12 53 159 kernel Uhhuh NMI received for unknown reason 2d on CPU 0 Aug 5 05 15 00 d mpk12 53 159 kernel Uhhuh NMI received for unknown reason 2d on CPU 1 Aug 5 05 15 00 d mpk12 53 159 kernel Dazed and confused but trying to continue Aug 5 05 15 00 d mpk12 53 159 kernel Do you have a strange power saving mode enabled Aug 5 05 15 00 d mpk12 53 159 kernel Uhhuh NMI received for unknown reason 3d on C
39. typematic rate 75 Initializes Int 13 and prepare for IPL detection 78 Initializes IPL devices controlled by BIOS and option ROMs 7A Initializes remaining option ROMs 7C Generate and write contents of ESCD in NVRam 84 Log errors encountered during POST 85 Display errors to the user and gets the user response for error 87 Execute BIOS setup if needed requested 8C After all device initialization is done programmed any user selectable parameters relating to NB SB such as timing parameters non cacheable regions and the shadow RAM cacheability and do any other NB SB PCIX OEM specific programming needed during Late POST Background scrubbing for DRAM and L1 and L2 caches are set up based on setup questions Get the DRAM scrub limits from each node 8D Build ACPI tables if ACPI is supported 8E Program the peripheral parameters Enable Disable NMI as selected 90 Late POST initialization of system management interrupt AO Check boot password if installed Al Clean up work needed before booting to OS A2 Takes care of runtime image preparation for different BIOS modules Fill the free area in FOOOh segment with OFFh Initializes the Microsoft IRQ Routing Table Prepares the runtime language module Disables the system configuration display if needed A4 Initialize runtime language module A7 Displays the system configuration screen if enabled Initialize the CPUs before boot which includes the programming of the MTRRs A8 Prepa
40. when CHIPKILL is disabled or DIMM pair when CHIPKILL is enabled and logs that information to the SP The BIOS will halt the CPU Unsupported Unsupported The BIOS displays an error message logs an DMI Log Fatal DIMM DIMMs are used or error and halts the system SP SEL configuration supported DIMMs are loaded improperly HyperTranspo CRC or link error Sync floods on HyperTransport links the DMI Log Fatal rt link failure on one of the machine resets itself and error information SP SEL HyperTransport Links gets retained through reset The BIOS reports A Hyper Transport sync flood error occurred on last boot press F1 to continue Appendix D Error Handling 197 TABLE D 1 Hardware Error Handling Summary Continued Logged DMI Error Description Handling Log or SP SEL Fatal PCI SERR System or parity Sync floods on HyperTransport links the DMI Log Fatal PERR error ona PCI bus machine resets itself and error information SP SEL gets retained through reset The BIOS reports A Hyper Transport sync flood error occurred on last boot press F1 to continue BIOS POST The BIOS could not The BIOS displays an error message logs the DMI Log Non fatal Microcode find or load the error to DMI and boots Error CPU Microcode Update to the CPU The message most likely appears when a new CPU is installed in a system controller with an outdated BIOS In this case the BIOS must be updated BIOS POST CMOS contents The BIOS dis
41. you can try viewing the power on self test POST messages and BIOS event logs during system startup Continue with Viewing Event Logs on page 159 Troubleshooting DIMM Problems Use this section to troubleshoot problems with memory modules or DIMMs Note For information on Sun s DIMM replacement policy for x64 servers contact your Sun Service representative Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 How DIMM Errors Are Handled By the System This section describes system behavior for the two types of DIMM errors uncorrectable errors UCEs and correctable errors CEs it also describes BIOS DIMM error messages Uncorrectable DIMM Errors For all operating systems OS s the behavior is the same for UCEs 1 When UCE occurs the memory controller causes an immediate reboot of the system 2 During reboot the BIOS checks the NorthBridge memory controller s Machine Check registers and determines that the previous reboot was due to an UCE then reports this message in POST after the memtest stage A Hypertransport Sync Flood occurred on last boot 3 Memory reports this event in the service processor s system event log SEL as shown in the sample IPMItool output below ipmitool H 10 6 77 249 U root P changeme I lanplus sel list 000 02 16 2006 03 32 38 OEM 0x12 100 OEM record e0 00000000040 0c0200200000a2 200 OEM record e0 01000000040000000000000000 300 02 16 2
42. 0 Predictive Failure Asserted In the sample output above the sensor name is in the first column and the corresponding sensor number is in the second column For a detailed explanation of each sensor listed by name refer to the Integrated Lights Out Manager Supplement Viewing Component Information With IPMItool You can view information about system hardware components The software refers to these components as field replaceable unit FRU devices To read the FRU inventory information on these servers you must first have the FRU ROMs programmed After that is done you can see a full list of the available FRU data by using the fru print command as shown in the following example only two FRU devices are shown in the example but all devices would be shown 152 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 ipmitool I lanplus H IPADDR U root P changeme fru print FRU Device Description Builtin FRU Device ID O Board Mfg BENCHMARK ELECTRONICS Board Product ASSY SERV PROCESSOR X4X00 Board Serial OOGOHSV 0523000195 Board Part Number 501 6979 02 Board Extra 000 000 00 Board Extra HUNTSVILLE AL USA Board Extra b302 Board Extra 06 Board Extra GRASP Product Manufacturer SUN MICROSYSTEMS Product Name ILOM FRU Device Description sp net0 fru ID 2 Product Manufacturer MOTOROLA Product Name FAST ETHERNET CONTROLLER Product Part Number MPC8248 FCC
43. 0 152 mismatching processors error handling 101 195 N NMI button 65 P parity errors handling 96 190 password changing with IPMItool 34 145 PERR 96 190 population rules for DIMMs 11 123 POST changing options 53 164 code checkpoints 57 167 codes table 55 166 overview 50 162 redirecting console output 51 163 Power button location 5 116 117 power off procedure 4 115 power problems troubleshooting 3 111 power on self test see POST processors mismatched error 101 195 R redirecting console output 51 163 related documentation xi reset button 65 S safety guidelines xi scripts IPMItool 45 156 SDR cache using with IPMItool 40 151 sensor data repository See SDR sensor IDs for LEDs 42 154 sensor number formats 40 152 sensors viewing with ILOM SP GUI 26 139 viewing with IPMItool 35 146 serial connection to ILOM SP 20 133 SERR 99 192 Service Processor system event log See SP SEL service visit information gathering 2 111 shutdown procedure 4 115 SP event log viewing with ILOM SP GUI 21 134 SP SEL clearing with IMPItool 40 151 sensor numbers and names 40 152 time stamps 23 137 using SDR cache 40 151 viewing with IPMItool 38 149 SSH key configuring with IPMItool 34 145 Sun Fire X4500 Power button 5 116 117 SunVTS Bootable Diagnostics CD 16 120 documentation 16 120 logs 17 122 overview 15 119 120 system errors handling 99 192
44. 0 HBA controller numbers and pci nodes Options Parameters Use the hd command to determine the status of a hard disk by mapping the drive location using the parameters shown in TABLE 7 1 The following options are supported for the functions shown TABLE 7 1 ha Options Option Description c Displays status in color mode There are three status rows for each device Physical slot location that matches the chassis label Logical location that matches Solaris OS storage device name 74 Book Title without trademarks or an abbreviated book title July 2009 TABLE 7 1 Option hd Options Continued Description no option cXtY Drive runtime status The following syntax is used Up arrow Indicates the device Green Device is enumerated Device is present and accessible Red Device is not enumerated or no drive in physical slot location Device is not accessible absent empty or down Devices under the controller are not enumerated The controller is not enumerated until there is a drive in the slots Yellow Device has warning messages Available in diagnose mode H Device has warning messages from the storage subsystem Blue Indicates bootable drive slot b Drive slot is bootable if an OS is installed on the drive Provides a summary list all the storage devices device types and count of all storage devices If the system is a not a Sun Fire X4500 server and the subsystem supports
45. 006 03 32 50 Memory Uncorrectable ECC CPU 1 DIMM 0 400 02 16 2006 03 32 50 Memory Memory Device Disabled CPU 1 DIMM 0 500 02 16 2006 03 32 55 System Firmware Progress Motherboard initialization 600 02 16 2006 03 32 55 System Firmware Progress Video initialization 700 02 16 2006 03 33 01 System Firmware Progress USB resource configuration Correctable DIMM Errors At this time CEs are not logged in the server s system event logs Note When running Solaris 10 the Fault Management Architecture FMA will manage memory CE s by providing fault monitoring and diagnosis Chapter 1 Initial Inspection of the Server 7 BIOS DIMM Error Messages The BIOS displays and logs three types of DIMM error messages m NODE n Memory Configuration Mismatch The following conditions causes this error message a DIMMs mode is not paired running in 64 bit mode instead of 128 bit mode a DIMMs speed are not same DIMMs do not support ECC DIMMs are not registered a MCT stopped due to errors in the DIMM a DIMM module type buffer is mismatched a DIMM generation I or II is mismatched a DIMM CL T is mismatched m Banks on a two sided DIMM are mismatched a DIMM organization is mismatched 128 bit a SPD is missing Tre or Trfc information NODE n Paired DIMMs Mismatch m NODE n Paired DIMMs Mismatch The following condition displays this error message a DIMMs pai
46. 0G V440 None Otheros c7tOdOpO K41BT4C7NVYS HITACHI HDS7225SBSUN250G V440 None Solaris2 c6t0d0p0 K41BT4CEE9NE HITACHI HDS7225SBSUN250G V440 None Solaris2 cOtOdOpO K41BT4CE447E HITACHI HDS7225SBSUN250G V440 None Otheros c7t4d0p0 K41BT4CE87AE HITACHI HDS7225SBSUN250G V440 None Otheros c4t4d0p0 K41BT4C838MS HITACHI HDS7225SBSUN250G V440 None LinuxNative Solaris LinuxNative c1t0d0p0 VNO3ZAGIWYWD HITACHI HDS7250SASUN500G K2AO None IFS NTFS cit4dopO K41BT4C7N4HS HITACHI HDS7225SBSUN250G V440 None None c5t1d0p0 VNO3ZAGAVSUD HITACHI HDS7250SASUN500G K2AO None None SunFirex4500 Rear 36 37 38 39 40 41 42 43 44 45 46 47 c5t3 c5t7 c4t3 c4t7 c7t3 c7t7 c63 c6t7 cit3 clt7 cot3 cOt7 24 253 26 PA ie 28 29 30 31 32 33 34 35 c5t2 c5t6 c4t2 c4t6 c7t2 c7t6 c6t2 c6t6 clt2 cit cOt2 cOt6 12 13 14 15 16 17 18 19 20 21 22 23 CSEL eEbt c4t1 c4t5 c7t1 c7t5 c6tl c6t5 cltl cit5 c0t1 coOt5 S No No No Non No 0 1 Zin 3e 4 Bis 6 oe 8 9 10 11 c5t c5t4 cat c4t4 c7t0 c7t4 c6t c6t4 cit clt4 cot cot4 b b S S S S S S S S S S SunFirex4500 Front Summary Vendor Model Count HITACHI HDS7225SBSUN250G 12 HITACHI HDS7250SASUN500G 2 Total Storage Devices 14 Partition Type Count Solaris2 6 None 3 Otheros 3 LinuxNative Solaris LinuxNative 1 IFS NTFS 1 Total part
47. 1 speed ft1 f0 speed ft1 f1 speed ft2 f0 speed ft2 f1 speed ft3 f0 speed ft3 f1 speed ft4 f0 speed ft4 f1 speed ft0 prsnt CPU board CPU board CPU board CPU board CPU board CPU board IO controller board IO controller board IO controller board IO controller board IO controller board IO controller board IO controller board IO controller board IO controller board IO controller board IO controller board IO controller board Disk backplane Fan board Fan board Fan board Fan board Fan board Fan board Fan board Fan board Fan board Fan board Fan board Book Title without trademarks or an abbreviated book title July 2009 TABLE C 1 Name of Sensor Location of Sensor ftl prsnt ft2 prsnt ft3 prsnt ft4 prsnt ps0 vinok ps0 pwrok psl vinok ps1 pwrok ps2 vinok ps2 pwrok ps0 prsnt psl prsnt ps2 prsnt hdd x state hdd x hba state Fan board Fan board Fan board Fan board Power supply Power supply Power supply Power supply Power supply Power supply Power supply Power supply Power supply Software sensors No corresponding hardware Status of the sensors are set by app running on host IO controller board Appendix C Sun Fire X4540 Sensor Locations 183 184 Book Title without trademarks or an abbreviated book title July 2009 APPENDIX D Error Handling This appendix contains information about how the servers process and log errors It includes the following
48. 12 53 159 kernel Uhhuh NMI received for unknown reason 2d on CPU 1 Aug 5 05 15 00 d mpk12 53 159 kernel Dazed and confused but trying to continue Aug 5 05 15 00 d mpk12 53 159 kernel Do you have a strange power saving mode enabled Aug 5 05 15 00 d mpk12 53 159 kernel Uhhuh NMI received for unknown reason 3d on CPU 1 Aug 5 05 15 00 d mpk12 53 159 kernel Dazed and confused but trying to continue Aug 5 05 15 00 d mpk12 53 159 kernel Do you have a strange power saving mode enabled Appendix B Error Handling 97 Aug 5 05 15 00 d mpk12 53 159 kernel Uhhuh NMI received for unknown reason 3d on CPU 0 Aug 5 05 15 00 d mpk12 53 159 kernel Dazed and confused but trying to continue Aug 5 05 15 00 d mpk12 53 159 kernel Do you have a strange power saving mode enabled Aug 5 05 15 00 d mpk12 53 159 kernel Dazed and confused but trying to continue Aug 5 05 15 00 d mpk12 53 159 kernel Do you have a strange power saving mode enabled 98 Book Title without trademarks or an abbreviated book title July 2009 Handling of System Errors SERR This section lists facts and considerations about how the server handles system errors SERR m System error handling works through the HyperTransport Synch Flood Error mechanism on 8111 and 8131 m The following events happen during BIOS POST a POST reports any previous system errors at the bottom of the screen See FIGURE B 5 for an example FIGURE B 5 POST Scree
49. 136 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 Interpreting Event Log Time Stamps The system event log time stamps are related to the service processor clock settings If the clock settings change the change is reflected in the time stamps When the service processor reboots the SP clock is set to Thu Jan 1 00 00 00 UTC 1970 The SP reboots as a result of the following m A complete system unplug replug power cycle m An IPMI command for example mc reset cold m A command line interface CLI command for example reset SP m ILOM web GUI operation for example from the Maintenance tab selecting Reset SP m An SP firmware upgrade After an SP reboot the SP clock is changed by the following m When the host is booted The host s BIOS unconditionally sets the SP time to that indicated by the host s RTC The host s RTC is set by the following operations a When the host s CMOS is cleared as a result of changing the host s RTC battery or inserting the CMOS clear jumper on the system controller The host s RTC starts at Jan 1 00 01 00 2002 m When the host s operating system sets the host s RTC The BIOS does not consider time zones Solaris and Linux software respect time zones and will set the system clock to UTC Therefore after the OS adjusts the RTC the time set by the BIOS will be UTC a When the user sets the RTC using the host BIOS Setup screen m Continuously via NTP if NTP is enabled on th
50. 2 led 00h ns 29 2 Generic Device 20h 19h 2 ft1 fm0 fail 40h ok 29 3 Predictive Failure Deasserted ft1 fm0 led 00h ns 29 3 Generic Device 20h 19h 3 ft1 fml fail 41h ok 29 4 Predictive Failure Deasserted ft1 fm1 led 00h ns 29 4 Generic Device 20h 19h 4 ft1 fm2 fail 42h ok 29 5 Predictive Failure Deasserted ft1 fm2 led 00h ns 29 5 Generic Device 20h 19h 5 t0 fm0 0 speed 43h ok 29 0 6000 RPM f t0 fml 0 speed 44h ok 29 1 6000 RPM f t0 fm2 0 speed 45h ok 29 2 6000 RPM Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 ftl fm0 0 speed 46h ok 29 3 6000 RPM ftl fmi fO speed 47h ok 29 4 6000 RPM ftl fm2 fO speed 48h ok 29 5 6000 RPM Other queries can include a particular type of sensor The command in the following example would return a list of all Temperature type sensors in the SDR ipmitool I lanplus H IPADDR U root P changeme sdr type temperature sys tempfail 03h ok 23 0 Predictive Failure Deasserted mb t_amb 05h ok 7 0 25 degrees C fp t_amb 14h ok 12 0 25 degrees C ps t_amb 1Bh ok 10 0 24 degrees C io t_amb 22h ok 15 0 23 degrees C pO t core 2Ch ok 3 0 35 degrees C pl t_core 35h ok 3 1 36 degrees C Using IPMItool to View the ILOM SP System Event Log The ILOM SP System Event Log SEL provides storage of all system events You can view the SEL with IPMItool This topic includes the following sections m
51. 3 c7t7 e6t3 c6t7 Jclt3 elt7 eat3 cOt7 24 25 26 27 28 29 30 31 32 33 34 35 c5t2 cS5t6 c4t2 c4t6 c7t2 c7t6 c6t2 c6t6 Jclt2 cit6 eat2 eat6 12 13 14 15 16 17 18 19 20 21 22 23 c5ti je5t5 c4t1 jc4t5 c7t1 c7t5 c6t1 c6t5 jezti jc1t5 jcOt1 eot5 0 1 2 3 4 5 6 7 8 9 10 11 c5t0 je5t4 je4to c4t4 e7to eTt4 c6t0 c6t4 jetto ett4 jcOtO eat4 c2 c3 s3 c5 c4 c7 c6 c1 co Pre ILOM 2 0 2 5 and One USB Device TABLE 7 5 Sun Fire X4500 Disk Mapping One USB Storage Device Pre ILOM 2 0 2 5 USB CD USB USB ROM Floppy Device Controller 3 Controller 2 Controller 5 Controller 4 Controller 1 Controller 0 44 45 46 47 clt3 elt7 c0t3 c0t7 32 33 34 35 s gt p gt c4ty cilt2 elt6 cOt2 cOt6 20 21 22 23 cilt1i elt5 c0t1 cOt5 8 9 10 11 cit0O elt4 cot0O cOt4 c2 c3 c4 cl c0 Asterisk denotes a USB device that is not installed on configuration shown in TABLE 7 4 Chapter 7 hd Utility 83 ILOM 2 0 2 5 or Later and No USB Device When no USB devices are present there is a direct 1 1 physical controller number to dev cXtY mapping For example Controller 2 has a c2 name Controller 3 has a c3 name and Controller 4 has a c4 name etc TABLE 7 6 Sun Fire X4500 Disk Mapping No USB Storage Device ILOM 2 0 2 5 USB CD ROM USB USB Floppy Device Controller 3 Controller 2 Controller 5 Controller 4 Controller 1 Controller 0 45 46 47 clt7 c0t3 cOt7 33 34 35 c1t6 c0t2 c
52. 4 Volts pl v_ lv25core 37h ok 3 1 1 32 Volts f t0 fm0 0 speed 43h ok 29 0 6000 RPM ft0 fml f 0 speed 44h ok 29 1 6000 RPM f t0 fm2 0 speed 45h ok 29 2 6000 RPM ftl fm0 0 speed 46h ok 29 3 6000 RPM ftl fml 0 speed 47h ok 29 4 6000 RPM ftl fm2 0 speed 48h ok 29 5 6000 RPM You can also generate a list of all sensors for a specific Entity Use the list output to determine which entity you are interested in seeing then use the sdr entity command to get a list of all sensors for that entity This command accepts an entity ID and an optional entity instance argument If an entity instance is not specified it will display all instances of that entity The entity ID is given in the fourth field of the output as read from left to right For example in the output shown in the previous example all the fans are entity 29 The last fan listed 29 5 is entity 29 with instance 5 ftl fm2 0 speed 48h ok 29 5 6000 RPM For example to see all fan related sensors you would use the following command that uses the entity 29 argument ipmitool I lanplus H IPADDR U root P changeme sdr entity 29 t0 fm0 fail 3Dh ok 29 0 Predictive Failure Deasserted t0 fm0 led 00h ns 29 0 Generic Device 20h 19h 0 ft0O fmi fail 3Eh ok 29 1 Predictive Failure Deasserted fto0 fmi led 00h ns 29 1 Generic Device 20h 19h 1 t0 fm2 fail 3Fh ok 29 2 Predictive Failure Deasserted fto fm
53. 8 mhy 12 Not Available _10 GR 12 978 Refresh Hide Thresholds 6 Click the Hide Thresholds button to revert to the sensor readings The sensor readings are redisplayed without the thresholds 7 If the problem with the server is not evident after viewing sensor readings information continue with Running SunVTS Diagnostic Tests on page 120 142 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 CHAPTER 12 Using IPMItool to View System Information This chapter contains information about using the Intelligent Platform Management Interface IPMI to view monitoring and maintenance information for your server This chapter includes the following sections m About IPMI on page 143 m About IPMItool on page 144 m Connecting to the Server With IPMItool on page 144 m Using IPMItool to Read Sensors on page 146 a Using IPMItool to View the ILOM SP System Event Log on page 149 m Viewing Component Information With IPMItool on page 152 m Viewing and Setting Status LEDs on page 153 About IPMI IPMI is an open standard hardware management interface specification that defines a specific way for embedded management subsystems to communicate IPMI information is exchanged though baseboard management controllers BMCs which are located on IPMI compliant hardware components Using low level hardware intelligence instead of the operating system has two main ben
54. 9 Each LED has both a descriptor and a status reading sensor and the two are linked that is if you use the 1ed sensor to turn on a particular LED then the status change is represented in the associated fai1 sensor Also for some of these an event is generated in the SEL For LEDs that blink on failure instead of steady on the events are not generated this is because it would display an event every time the LED flashed in the blink cycle TABLE 4 3 lists the LED sensor IDs in these servers See Identifying Status and Fault LEDs on page 171 for diagrams of the LED locations TABLE 4 3 LED Sensor IDs LED Sensor ID Description sys power led sys locate led sys alert led sys psfail led sys tempfail led sys fanfail led bp power led bp locate led bp alert led fp power led fp locate led fp alert led io hdd0 led io hdd1 led io hdd2 led io hdd3 led io f0 led p0 led p0 d0 led p0 d1 led poO d2 led p0 d3 led pl led System Power front back System Locate front back System Alert front back System Power Supply Failed System Over Temperature System Fan Failed Back Panel Power Back Panel Locate Back Panel Alert Front Panel Power Front Panel Locate Front Panel Alert Hard Disk 0 Failed Hard Disk 1 Failed Hard Disk 2 Failed Hard Disk 3 Failed I O Fan Failed CPU 0 Failed CPU 0 DIMM 0 Failed CPU 0 DIMM 1 Failed CPU 0 DIMM 2 Failed CPU 0 DIMM 3 Failed CPU 1 Failed Chapter 4 Using IP
55. AMIBIOS C 2006 American Megatrends Inc BIOS Build Version OABNFO10 Date 04 04 08 18 56 20 Core 08 00 14 CPU Quad Core AMD Opteron tm Processor 2356 Speed 2 30 GHz Count 8 NodeO DCTO 667 MHz DCTI 667 MHz Nodel DCTO 667 MHz DCTI 667 MHz Sun Fire X4540 2 AMD North Bridges Rev B3 NVMM ROM Version 4 081 40 BMC Firmware Revision 2 0 2 3 CPLD Revision 2 0 SP IP Address 010 006 143 054 Initializing USB Controllers Done Press F2 to run Setup CTRL E on Remote Keyboard Press F8 for BBS POPUP CTRL P on Remote Keyboard Press F12 to boot from the network CTRL N on Remote Keyboard System Memory 64 0 GB USB Device s 2 Keyboards 2 Mice 1 Hub Auto detecting USB Mass Storage Devices 00 USB mass storage devices found and configured 0085 BMC Responding Press lt ESC gt to continue 6 m No SEL or DMI event is recorded m The system enters Halt mode and the following message is displayed KEKRRRZA Warning Bad Mix of Processors FREKERAENK Multiple core processors cannot be installed with single core processors Fatal Error System Halted Appendix D Error Handling 195 Hardware Error Handling Summary TABLE D 1 summarizes the most common hardware errors that you might encounter with these servers TABLE D 1 Hardware Error Handling Summary Logged DMI Error Description Handling Log or SP SEL Fatal SP failure The SP fails t
56. CPU frequency has been calculated to prevent bad programming 0A Initializes the 8042 compatible Keyboard Controller 0B Detects the presence of PS 2 mouse 0C Detects the presence of Keyboard in KBC port OE Testing and initialization of different Input Devices Also update the Kernel Variables Traps the INTO9h vector so that the POST INT09h handler gets control for IRQI Uncompress all available language BIOS logo and Silent logo modules 13 Initializes PM regs and PM PCI regs at Early POST Initializes multi host bridge if system support it Setup ECC options before memory clearing REDIRECTION causes corrected data to written to RAM immediately CHIPKILL provides 4 bit error det corr of x4 type memory Enable PCI X clock lines in the 8131 20 Relocate all the CPUs to a unique SMBASE address The BSP will be set to have its entry point at A000 0 If less than 5 CPU sockets are present on a board subsequent CPUs entry points will be separated by 8000h bytes If more than 4 CPU sockets are present entry points are separated by 200h bytes CPU module will be responsible for the relocation of the CPU to correct address NOTE APs are left in the INIT state 24 Uncompress and initialize any platform specific BIOS modules 30 Initializes System Management Interrupt 2A Initializes different devices through DIM 2C Initializes different devices Detects and initializes the video adapter installed in the system that have optional ROMs 2E In
57. Hide Thresholds 6 Click the Hide Thresholds button to revert to the sensor readings The sensor readings are redisplayed without the thresholds 7 If the problem with the server is not evident after viewing sensor readings information continue with Running SunVTS Diagnostic Tests on page 120 Chapter 3 Using the ILOM Service Processor GUI to View System Information 29 30 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 CHAPTER 4 Using IPMItool to View System Information This appendix contains information about using the Intelligent Platform Management Interface IPMI to view monitoring and maintenance information for your server It the following sections About IPMI on page 32 About IPMItool on page 32 Connecting to the Server With IPMItool on page 33 Using IPMItool to Read Sensors on page 35 Using IPMItool to View the ILOM SP System Event Log on page 38 Viewing Component Information With IPMItool on page 41 Viewing and Setting Status LEDs on page 42 31 About IPMI IPMI is an open standard hardware management interface specification that defines a specific way for embedded management subsystems to communicate IPMI information is exchanged through baseboard management controllers BMCs which are located on IPMI compliant hardware components Using low level hardware intelligence instead of the operating system has two main benefits first this
58. Identifying Status and Fault LEDs 171 Front Panel Features 172 Rear Panel Features 174 Internal Status Indicator LEDs 175 Disk Drive and Fan Tray LEDs 176 CPU Board LEDs 178 Sun Fire X4540 Sensor Locations 181 Error Handling 185 Uncorrectable Errors 185 Correctable Errors 188 Parity Errors PERR 190 System Errors SERR 192 Handling Mismatched Processors 195 Hardware Error Handling Summary 196 Index 201 Contents ix Undefined BookTitleFooter July 2009 Preface The Sun Fire X4500 X4540 Server Diagnostics Guide contains information and procedures to troubleshoot and diagnose problems with Sun Fire X4500 X4540 Servers Before You Read This Document It is important that you review the safety guidelines in the Sun Fire X4500 Server Safety and Compliance Guide 819 4776 Related Documentation For a description of the document set for the Sun Fire X4500 X4540 servers see the Where To Find Documentation sheet that is packed with your system and also posted at the product s documentation site See the following URLs http docs sun com app docs prod sf x4500 hic http docs sun com app docs prod sf x4540 hic Translated versions of some of these documents are available at the web site described above in French Simplified Chinese and Japanese English documentation is revised more frequently and might be more up to date than the translated documentation xi For Sun hardware documentation Solari
59. Itool This topic includes the following sections m Viewing the SEL With IPMItool on page 38 m Clearing the SEL With IPMItool on page 40 m Using the Sensor Data Repository SDR Cache on page 40 m Sensor Numbers and Sensor Names in SEL Events on page 40 Viewing the SEL With IPMItool Two separate IPMI commands allow you to see different levels of detail in the LOM SP SEL m To view the ILOM SP SEL with a minimal level of detail type the sel list command ipmitool I lanplus H IPADDR U root P changeme sel list 100 Pre Init Time stamp Entity Presence 0x16 Device Absent 200 Pre Init Time stamp Entity Presence 0x26 Device Present 300 Pre Init Time stamp Entity Presence 0x25 Device Absent 400 Pre Init Time stamp Phys Security 0x01 Gen Chassis intrusion 500 Pre Init Time stamp Entity Presence 0x12 Device Present Note When you use this command an event record shows a sensor number but does not display the name of the sensor for the event For example in line 100 in the sample output above the sensor number 0x16 is displayed For information about how to map sensor names to the different sensor number formats that might be displayed see Sensor Numbers and Sensor Names in SEL Events on page 40 38 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 m To view the ILOM SP SEL with a detailed event output type the sel elist command instead o
60. KKK KEK KKK AA AA A A KKK KK KKK KKKK KKK KKK fi OK on the Event Log Ne View Event Log A Mark all events as read x sa Clear Event Log k WE k ae Ye a Ye RO RE Select Screen ke Select Item ki Enter Go to Sub Screen kai E EL General Help a FILO Save and Exit ESC Exit A KKK KKK KKKKKKK KKK KKK KKK KK KKK A A KK KKK KKK KEK KKK KKK KEKE KK KEK KKK KKK KKK KKK A AN ANN KKK KKK KK KKK c From the Event Logging Details screen select View Event Log All unread events are displayed 4 View the BMC system event log 160 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 a From the BIOS Main Menu screen select Advanced The Advanced Settings screen is displayed See below b From the Advanced Settings screen select IPMI 2 0 Configuration The Advanced Menu IPMI 2 0 Configuration screen is displayed Advanced KAKKKKK KKK KKK KK KR KKK KR KKK KKK AH A A At ANA AA AA KKK KKK KKKKKKKK KKK K KKK KKK KKK KKK KR KKK KR KKK KKK KK IPMI 2 0 Configuration View all events in the KKK KEK KKK KKK KKK KK KKK KEK KKK AT A A AH KEK KK KKK KKKKKKEKKKKEKKKKEKEK OK BMC Event Log Status Of BMC Working kd Li View BMC System Event Log It will take up to ka Reload BMC System Event Log 60 Seconds approx ki Clear BMC System Event Log to read all LAN Configuration BMC SEL records PEF Configuration BMC Watch Dog Timer Actio
61. MItool to View System Information 43 TABLE 4 3 LED Sensor IDs Continued LED Sensor ID Description p1 d0 led CPU 1 DIMM 0 Failed p1 d1 led CPU 1 DIMM 1 Failed p1 d2 led CPU 1 DIMM 2 Failed p1 d3 led CPU 1 DIMM 3 Failed ft0 fm0 led Fan Tray 0 Module 0 Failed ft0 fm1 led Fan Tray 0 Module 1 Failed ft0 fm2 led Fan Tray 0 Module 2 Failed ft1 fm0 led Fan Tray 1 Module 0 Failed ftl fm1 led Fan Tray 1 Module 1 Failed ftl fm2 led Fan Tray 1 Module 2 Failed LED Modes You supply the modes in TABLE 4 4 to the lead set commands to specify the mode in which you want the LED to be placed TABLE 4 4 LED Modes Mode Description OFF LED off ON LED steady on STANDBY 100 ms on 2900 ms off SLOW 1 Hz blink rate FAST 4 Hz blink rate LED Sensor Groups Because each LED has its own sensor and can be controlled independently there is some overlap in sensors In particular there are separate LEDs defined for the power locate and alert LEDs on the front and back panels 44 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 It is desirable to have these sensors linked so that both the front and back panel LEDs can be controlled at the same time This is handled through the use of Entity Association Records These are records in the SDR that contain a list of entities that are considered part of a group For each Entity Association Record we also define another Generic Device Locator as a logical entity to
62. N Chassis Information Type Part Number Serial Number Board Information Manufacturer Product Name Serial Number Part Number Product Information Manufacturer Name Product Name Serial Number Part Number Rack Mount Chassis 541 0250 01 DOGOHSI O503AMO3B7 BENCHMARK ELECTRONICS ASY MOTHERBRD GALAKY1 2 DO6OHSY O503000313 500 6974 01 SUN MICROSYSTEMS GALAXY 1 O503AMFO40 602 2813 01 3 Select a component from the drop down list Information about the selected component is displayed 4 If the problem with the server is not evident after viewing replaceable component information continue with Running SunVTS Diagnostic Tests on page 120 Chapter 3 Using the ILOM Service Processor GUI to View System Information 25 Viewing Temperature Voltage and Fan Sensor Readings This section describes how to view the Sun Fire X4500 server temperature voltage and fan sensor readings There are a total of six temperature sensors that are monitored They all generate IPMI events that will be logged in to the system event log SEL when an upper threshold is exceeded Three of these sensor readings are used to adjust the fan speeds and perform other actions such as illuminating LEDs and powering off the chassis These sensors and their respective thresholds are as follows m Front panel ambient temperature fp t_amb Upper non critical 30 degrees C Upper critical 35 degrees C Upper non recoverab
63. OS pre boot process TABLE 13 2 describes the type of checkpoints that might occur during the POST portion of the BIOS These two digit checkpoints are the output from primary I O port 80 Chapter 13 Event Logs and POST Codes 167 TABLE 13 2 POST Code Checkpoints Post Code Description 03 Disable NMI Parity video for EGA and DMA controllers At this point only ROM accesses are to the GPNV If BB size is 64K require to turn on ROM Decode below FFFF0000h It should allow USB to run in E000 segment The HT must program the NB specific initialization and OEM specific initialization can program if it need at beginning of BIOS POST like overriding the default values of Kernel Variables 04 Check CMOS diagnostic byte to determine if battery power is OK and CMOS checksum is OK Verify CMOS checksum manually by reading storage area If the CMOS checksum is bad update CMOS with power on default values and clear passwords Initialize status register A Initializes data variables that are based on CMOS setup questions Initializes both the 8259 compatible PICs in the system 05 Initializes the interrupt controlling hardware generally PIC and interrupt vector table 06 Do R W test to CH 2 count reg Initialize CH O as system timer Install the POSTINTICh handler Enable IRQ O in PIC for system timer interrupt Traps INTICh vector to POSTINT1ChHandlerBlock Co Early CPU Init Start Disable Cache Init Local APIC Cl Set up boot strap proces
64. Ot6 21 22 23 c1t5 c0t1 cOt5 9 10 11 c3t0 c3t4 c2t0 ezt4 c5t0 c5t4 c4t0 c4t4 eltl elt4 jeaoto cOt4 84 ILOM 2 0 2 5 or Later and One USB Device When the ILOM version is 2 0 2 5 or later the remote floppy and remote CD ROM are now treated as USB storage devices and are only mapped when they are enabled in the javaRConsole The channel numbers change depending on how many total USB storage devices are present at the time of OS installation All USB devices gets Book Title without trademarks or an abbreviated book title July 2009 enumerated between physical Controller 1 and Controller 2 which causes a shift in naming when compared to systems without any USB storage devices For example a USB DVD gets the c2 name and Controller 2 gets the c3 name TABLE7 7 Sun Fire X4500 Disk Mapping One USB Storage Device ILOM 2 0 2 5 USB USB USB Device Device Device Controller 3 Controller 2 Controller 5 Controller 4 Controller 1 Controller 0 36 37 c2ty c4t3 c4t7 24 25 c2ty c4t2 c4to6 12 13 c2ty c4ati jc4t5 0 1 c2ty c4to c4t4 S c2 c4 Asterisk denotes the device name c2 mapped to the only USB device on the system Chapter 7 hd Utility 85 ILOM 2 0 2 5 or Later and Three USB Storage Devices The following disk mapping applies to a system with three USB storage
65. Out Manager Login screen is displayed b Type your user name and password When you first try to access the ILOM SP you are prompted to type the default user name and password The default user name and password are Default user name root Default password changeme 2 From the System Monitoring tab select Event Logs The System Event Logs page is displayed See FIGURE 11 1 for sample information FIGURE 11 1 System Event Logs Page REFRESH LOG OUT Sensor Readings Event Logs Locator Indicator System Event Logs View sensor specific BIOS generated or system management software event logs Select an event log category Sensor Specific Events yi Event Log 4 event entries Event ID Time Stamp Sensor Name Sensor Type Description 4 12131 1969 16 01 01 pst vinok Power Supply State Asserted Asserted 3 1213141969 16 01 01 psO prsnt Entity Presence Device Removed Device Absent Asserted 2 12 31 1969 16 00 57 pst prsnt Entity Presence Device Inserted Device Present Asserted 1 12 31 1969 16 00 56 pst pwrok Power Supply State Deasserted Asserted Chapter 11 Using the ILOM Service Processor GUI to View System Information 135 3 Select a category of an event that you want to view in the log from the drop down menu You can select from the following types of events m Sensor specific events These events relate to a specific sensor for a component for example a fan sensor or a power sup
66. PS HITACHI HDS7225SBSUN250G V440 None c0t4d0s2 K41BT4C7N4HS HITACHI HDS7225SBSUN250G V440 None clt0d0os2 K41BT4C7MTSS HITACHI HDS7225SBSUN250G V440 None clt4do0os2 K41BT4C7NXHS HITACHI HDS7225SBSUN250G V440 None c2t0d0s2 AMI Virtual CDROM 1 00 None c3t0d0s2 AMI Virtual Floppy 1 00 None c4t0d0s2 TEAC DV W516GA C482 None c5t0d0s2 K41BT4C7NVYS HITACHI HDS7225SBSUN250G V440 None c5t4d0s2 K41BT4C7MP2S HITACHI HDS7225SBSUN250G V440 None c6t0d0s2 K41BT4C7P2BS HITACHI HDS7225SBSUN250G V440 None c6t4d0s2 K41BT4C7NG1S HITACHI HDS7225SBSUN250G V440 None c7t0d0s2 K41BT4C7N54S HITACHI HDS7225SBSUN250G V440 None c7t4d0s2 K41BT4C7NVES HITACHI HDS7225SBSUN250G V440 None c8t0d0s2 K41BT4C7MKRS HITACHI HDS7225SBSUN250G V440 None c8t4d0s2 K41BT4C7N49S HITACHI HDS7225SBSUN250G V440 None Sun Fire X4500 Server Rear Chapter 7 hd Utility 77 FIGURE 7 4 Continued hd Utility Summary 36 Sie 38 39 40 41 42 43 44 45 46 47 c6t3 c6t7 ct3 cSt7 c8t3 c8t7 c7t3 c7t7 clt3 cilt7 coOt3 cOt7 24 25 26 PAN ee 28 29 30 31 32 33 3 34 35 c6t2 c6t6 cdSt2 c5t6 c8t2 act6 c7t2 c7t6 clt2 clt6 cO0t2 cOt6 12 13 14 15 16 17 18 19 20 21 22 23 c6t1 c6t5 c5t1 c5t5 c8t1 c8t5 c7t1 c7t5 c1t1 clt5 c0t1 coOt5 0 T 2 3 4 5 6 7 8 92 10 11 c6t c6t4 cdt c5t4 c8tO c8t4 c7tO c7t4 cit cilt4 cOt cot4 b b Att Att Na Na Na Na Na Att Att Att
67. PU 1 Aug 5 05 15 00 d mpk12 53 159 kernel Dazed and confused but trying to continue Aug 5 05 15 00 d mpk12 53 159 kernel Do you have a strange power saving mode enabled Appendix D Error Handling 191 Aug 5 05 15 00 d mpk12 53 159 kernel Uhhuh NMI received for unknown reason 3d on CPU 0 Aug 5 05 15 00 d mpk12 53 159 kernel Dazed and confused but trying to continue Aug 5 05 15 00 d mpk12 53 159 kernel Do you have a strange power saving mode enabled Aug 5 05 15 00 d mpk12 53 159 kernel Dazed and confused but trying to continue Aug 5 05 15 00 d mpk12 53 159 kernel Do you have a strange power saving mode enabled System Errors SERR This section lists facts and considerations about how the server handles system errors SERR m System error handling works through the HyperTransport Synch Flood Error mechanism on 8111 and 8131 m The following events happen during BIOS POST a POST reports any previous system errors at the bottom of the screen See FIGURE D 5 for an example 192 Book Title without trademarks or an abbreviated book title July 2009 FIGURE D 5 POST Screen Previous System Error Listed i S American gt SUN microsystems www ami com BMC Firmware Revision 1 00 Checking NURAM Initializing USB Controllers Done Press F2 to run Setup CTRL E on Remote Keyboard Press F12 to boot from the network CTRL N on Remote Keyboard ISB Device s 3 Keyboards 3 Mice 2 Storag
68. Product Serial 00 03 BA D8 73 AC Product Extra 01 Product Extra 00 03 BA D8 73 AC Viewing and Setting Status LEDs In these servers all LEDS are activity oriented that is the SP is responsible for the I2C commands that assert and deassert each GPIO pin for each flash cycle The IPMItool command for reading LED status is ipmitool I lanplus H IPADDR sunoem led get lt sensor ID The IPMItool command for setting LED status is ipmitool I lanplus H IPADDR sunoem led set sensor ID LED mode It is possible for both of these commands to operate on all sensors at once by substituting a11 for the sensor ID That way you can easily get a list of all LEDs and their status with one command See LED Sensor IDs on page 154 and LED Modes on page 155 for information about the variables in these commands Chapter 12 Using IPMItool to View System Information 153 LED Sensor IDs All LEDs in this server are represented by two sensors m A Generic Device Locator record describes the location of the sensor in the system It has an 1ed suffix and is the name that is fed into the led set and led get commands You can get a list of all of these sensors by issuing the sdr list generic command m A Digital Discrete fault sensor monitors the status of the LED pin and is asserted when the LED is active These sensors have a fail suffix and are used to report events to the SEL Each LED has both a descri
69. S amp o SUN microsystems Sun Fire X4500 X4540 Servers Diagnostics Guide Sun Microsystems Inc www sun com Part No 819 4363 12 July 2009 Revision A Submit comments about this document by clicking the Feedback link at http docs sun com Copyright 2009 Sun Microsystems Inc 4150 Network Circle Santa Clara California 95054 U S A All rights reserved This distribution may include materials developed by third parties Sun Sun Microsystems the Sun logo Java Netra Solaris Sun Ray and Sun Fire X4540 Backup Server are trademarks or registered trademarks of Sun Microsystems Inc or its subsidiaries in the U S and other countries This product is covered and controlled by U S Export Control laws and may be subject to the export or import laws in other countries Nuclear missile chemical biological weapons or nuclear maritime end uses or end users whether direct or indirect are strictly prohibited Export or reexport to countries subject to U S embargo or to entities identified on U S export exclusion lists including but not limited to the denied persons and specially designated nationals lists is strictly prohibited Use of any spare or replacement CPUs is limited to repair or one for one replacement of CPUs in products exported in compliance with U S export laws Use of CPUs as product upgrades unless authorized by the U S Government is strictly prohibited Copyright 2009 Sun Microsystems Inc 4150
70. S code is shadowed that is copied from ROM to DRAM 2 Once executing out of DRAM the BIOS performs a simple memory test a write read of every location with the pattern 55aa55aa Note This memory test is performed only if Quick Boot is not enabled from the Boot Settings Configuration screen Enabling Quick Boot causes the BIOS to skip the memory test See Changing POST Options on page 53 for more information Note Because the Sun Fire X4500 server can contain up to 32GB of memory the memory test can take several minutes You can escape from POST testing by pressing any key during POST 3 The BIOS polls the memory controllers for both correctable and uncorrectable memory errors and logs those errors into the service processor Redirecting Console Output Use the following instructions to access the service processor and redirect the console output so that the BIOS POST codes can be read To redirect console output 1 Initialize the BIOS Setup utility by pressing the F2 key while the system is performing the power on self test POST The BIOS Main menu screen is displayed 2 Select the Advanced menu tab The Advanced Settings screen is displayed 3 Select IPMI 2 0 Configuration The IPMI 2 0 Configuration screen is displayed Chapter 5 Event Logs and POST Codes 51 4 Select the LAN Configuration menu item The LAN Configuration screen is displayed 5 Determine the server s IP address
71. S kernel errors are errors that relate to running SunVTS and not to testing of devices The log file path name is var sunvts logs vtsk_stderr txt This file is not created until SunVTS reports a SunVTS kernel error m SunVTS information log contains informative messages that are generated when you start and stop the SunVTS test sessions The log file path name is var sunvts logs sunvts info This file is not created until a SunVTS test session runs m Solaris system message log is a log of all the general Solaris events logged by syslogd The path name of this log file is var adm messages Requirements To use the Sun Fire X4540 Server Bootable Diagnostics CD you must have a USB CD ROM drive keyboard mouse and monitor attached to the server on which you are performing diagnostics Using the Bootable Diagnostics CD To use the Sun Fire X4540 Server Bootable Diagnostics CD to perform diagnostics 1 Install the USB CD ROM drive into the Sun Fire X4540 Server 2 With the server powered on insert the Sun Fire X4540 Server Bootable Diagnostics CD 705 1439 into the DVD ROM drive 3 Reboot the server but press F2 during the start of reboot so that you can change the BIOS setting for boot device priority 4 When the BIOS Main menu appears navigate to the BIOS Boot menu Instructions for navigating within the BIOS screens are printed on the BIOS screens 5 On the BIOS Boot menu screen select Boot Device Priority The Boo
72. S rather than a single physical LED TABLE 12 4 describes the LED sensor groups TABLE 12 4 LED Sensor Groups Group Name Sensors in Group sys power led bp power led fp power led sys locate led bp locate led fp locate led sys alert led bp alert led fp alert led For example to set both the front and back panel Power OK LEDs to a standby blink rate you could type the following command ipmitool I lanplus H IPADDR U root P changeme sunoem led set sys power led standby Set LED fp power led to STANDBY Set LED bp power led to STANDBY You could turn off the back panel Power OK LED but leave the front panel Power OK LED blinking by typing the following command ipmitool I lanplus H IPADDR U root P changeme sunoem led set bp power led off Set LED bp power led to OFF Using IPMItool Scripts for Testing For testing purposes it is often useful to change the status of all or at least several LEDs at once You can do this by constructing an IPMItool script and executing it with the exec command 156 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 The following example shows a script to turn on all fan module LEDS sunoem sunoem sunoem sunoem sunoem sunoem led led led led led led set set set set set set fto ft0 fto FEL fE FEL fm0 fmi fm2 fm0 fmi fm2 led on led on led on led on led on led on If this script file were then named leds
73. S s low 1MB extraction space proper boot happens ipmitool gt sel list 100 08 26 2005 05 04 04 OEM 0xfb 200 08 26 2005 05 04 09 Memory Memory Device Disabled CPU 0 DIMM 0 m Note the following considerations for this revision a Uncorrectable ECC Memory Error is not reported m Multi bit ECC errors are reported as Memory Device Disabled a On first reboot BIOS logs a HyperTransport Error in the DMI log a The BIOS disables the DIMM m The BIOS sends the SEL records to the BMC a The BIOS reboots again m The BIOS skips the faulty DIMM on the next POST memory test m The BIOS reports available memory excluding the faulty DIMM pair FIGURE D 1 shows an example of a DMI log screen from the BIOS Setup Page 186 Book Title without trademarks or an abbreviated book title July 2009 FIGURE D 1 DMI Log Screen Uncorrectable Error BIOS SETUP UTILITY Advanced Event Logging details View all unread events nm on the Event Log Mark all events as read Clear Event Log View Event Log 09 12 05 11 51 05 A Hyper Transport sync flood error occurred on last boot Enter Go to Sub Screen FI General Help F10 Save and Exit ESC Exit v02 53 C Copyright 1985 2002 American Megatrends Ine Appendix D Error Handling 187 Correctable Errors This section lists facts and considerations about how the server handles correctable errors m During BIOS POST m The BIOS polls the MCK registers
74. Service Action Required SP SEL Fatal failure detected by reading and individual fan module LEDs are lit tach signals Single power When any of the Service Action Required and Power Supply SP SEL Non fatal supply failure AC DC Fault LEDs are lit PS_VIN_GOOD or PS_PWR_OK signals are deasserted DC DC power Any The Service Action Required LED is lit the SP SEL Fatal converter POWER_GOOD system is powered down to standby power failure signal is deasserted mode and the Power LED enters standby from the DC DC blink state converters Voltage The SP monitors The Service Action Required LED and Power SP SEL Fatal above below system voltages and Supply Fault LED blink Threshold detects voltage above or below a given threshold High The SP monitors The Service Action Required LED and System SP SEL Fatal temperature CPU and system Overheat Fault LED blink The system temperatures and controller is shut down above the specified detects critical level temperatures above a given threshold Processor The CPU drives the CPLD shuts down power to the CPU The SP SEL Fatal thermal trip THERMTRIP_L Service Action Required LED and System signal when it Overheat Fault LED blink detects an overtemp condition Boot device The BIOS is not able The BIOS goes to the next boot device in the DMI Log Non fatal failure to boot from a device in the boot device list list If all devices in the list fail an error message is displayed Ret
75. Single bit With ECC enabled The CPU corrects the error in hardware No SP SEL Normal DRAM ECC in the BIOS Setup interrupt or machine check is generated by operation error the CPU detects the hardware The polling is triggered every and corrects a half second by SMI timer interrupts and is single bit erroron done by the BIOS SMI handler the DIMM interface The BIOS SMI handler starts logging each detected error and stops logging when the limit for the same error is reached The BIOS s polling can be disabled through a software interface Single four bit With CHIP KILL The CPU corrects the error in hardware No SP SEL Normal DRAM error enabled in the BIOS interrupt or machine check is generated by operation Setup the CPU the hardware The polling is triggered every detects and corrects half second by SMI timer interrupts and is for the failure ofa done by the BIOS SMI handler four bit wide The BIOS SMI handler starts logging each DRAM on the detected error and stops logging when the DIMM interface limit for the same error is reached The BIOS s polling can be disabled through a software interface Uncorrectable The CPU detects an The sync flood method of handling this is SP SEL Fatal DRAM ECC uncorrectable used to prevent the erroneous data from error multiple bit DIMM being propagated across the HyperTransport error links The system reboots the BIOS recovers the machine check register information maps this information to the failing DIMM
76. This option is for the Sun Fire X4500 server only It provides a list of Sun Fire X4500 hard drive physical slot numbers logical names and status present or absent This capability is useful for scripting environments For example some applications could include hd q in noninteractive mode to determine if a specific drive in a specific physical slot is accessible before configuring RAID 76 Book Title without trademarks or an abbreviated book title July 2009 TABLE 7 1 hd Options Continued Option Description 1 Lists the Sun Fire X4500 accessible disks in sequential order This option does not include the physical slot number B Lists the Sun Fire X4500 bootable slot numbers Solaris OS logical disk names and status present or absent r Lists the SMART data for all disks in a drive slot number R Lists the SMART data individual ID in landscape view for all disks e Lists the SMART data for a specified disk lt cXtY gt j Lists the SunFire X4500 server HBA controller numbers and PCI nodes Example Using the ha Utility The following command starts the utility in color mode and summarizes all the storage devices in the system hd c s Here is an example of output listing a summary of all storage devices FIGURE 7 4 had Utility Summary platform Sun Fire X4500 Server Device Serial Vendor Model Revision Temperature cOt0dos2 K41BT4C7M6
77. able Diagnostics CD 705 1439 This CD is designed so that the server will boot from the CD This CD boots the Solaris operating system and starts SunVTS software Diagnostic tests run and write output to log files that the service technician can use to determine the problem with the server Requirements To use the Sun Fire X4500 Server Bootable Diagnostics CD you must have a keyboard mouse and monitor attached to the server on which you are performing diagnostics Using the Bootable Diagnostics CD To use the Sun Fire X4500 Server Bootable Diagnostics CD to perform diagnostics Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 1 With the server powered on insert the Sun Fire X4500 Server Bootable Diagnostics CD 705 1439 into the DVD ROM drive 2 Reboot the server but press F2 during the start of reboot so that you can change the BIOS setting for boot device priority 3 When the BIOS Main menu appears navigate to the BIOS Boot menu Instructions for navigating within the BIOS screens are printed on the BIOS screens 4 On the BIOS Boot menu screen select Boot Device Priority The Boot Device Priority screen appears 5 Select the DVD ROM drive to be the primary boot device 6 Save and exit the BIOS screens 7 Reboot the server When the server reboots from the CD in the DVD ROM drive the Solaris Operating System boots and SunVTS software starts and opens its first GUI window 8 In the SunVTS GUI
78. ad Sensors For more information about supported IPMI 2 0 commands and the sensor naming for this server refer to the Integrated Lights Out Manager Administration Guide Reading Sensor Status You can read sensor status ranging from a broad overview that lists all sensors to querying individual sensors and returning detailed information on them For information on the physical locations of the sensors in the system see Sun Fire X4500 Sensor Locations on page 87 Reading All Sensors To view a list of all sensors in the servers and their status use the sdr list command with no arguments This command returns a large table that includes every sensor in the server and its status The five fields of the output lines as read from left to right are 1 IPMI sensor ID 16 character maximum 2 IPMI sensor number 3 Sensor status indicates thresholds that have been exceeded 4 Entity ID and instance 5 Sensor reading For example TABLE 4 1 fp t amb OAh ok 12 0 22 degrees C Chapter 4 Using IPMItool to View System Information 35 Reading Specific Sensors You can refine the output to see only specific sensors by setting the sdr list command with an optional argument to limit the output to sensors of a specific type The default output is a long list of sensors TABLE 4 2 describes the available sensor arguments TABLE 4 2 IPMItool Sensor Arguments Argument Description Sensors all All sensor records All sensor
79. al 00 03 BA D8 73 AC Product Extra 01 Product Extra 00 03 BA D8 73 AC Viewing and Setting Status LEDs In the Sun Fire X4500 X4540 Serverss all LEDS are activity oriented In activity oriented LEDS the SP is responsible for the I2C commands that assert and deassert each GPIO pin for each flash cycle Use the following IPMItool command to read LED status ipmitool I lanplus H IPADDR sunoem led get sensor ID Use the following IPMItool command to set LED status ipmitool I lanplus H lt IPADDR gt sunoem led set lt sensor ID LED mode Both of these commands can operate on all sensors at once by substituting a11 for the sensor ID That way you can easily get a list of all LEDs and their status with one command See LED Sensor IDs on page 42 and LED Modes on page 44 for information about the variables in these commands LED Sensor IDs All LEDs in this server are represented by two sensors m A Generic Device Locator record describes the location of the sensor in the system It has an 1ed suffix and is the name that is fed into the led set and led get commands You can get a list of all of these sensors by issuing the sdr list generic command m A Digital Discrete fault sensor monitors the status of the LED pin and is asserted when the LED is active These sensors have a fail suffix and are used to report events to the SEL 42 Sun Fire X4500 X4540 Servers Diagnostics Guide July 200
80. and indicators TABLE 8 1 describes the controls and indicators 112 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 FIGURE 8 2 Sun Fire X4540 Server Front Panel LEDs 2 3 4 5 6 7 LED TABLE 8 1 Front Panel Controls and Indicators Name Color Description Locate White Operators can turn this LED On remotely to help then button LED locate the server in a crowded server room Press to turn off Pressing the Locate LED Switch for five seconds turns all indicators ON for 15 seconds System Fault White On When service action is required Power Operation Green Steady Power is On Blink Standby power is On but main power is Off Off Power is Off System power Grey To power on main power for all the server button components Top failure LED Amber On HDD or fan fault Rear failure LED Amber On Power supply or system controller fault service is required Over Temperature Amber On When system is over temperature 2 Inspect the back panel LEDs for indications of component malfunction FIGURE 8 3 shows the rear panel features TABLE 8 2 describes each feature Chapter 8 Initial Inspection of the Server 113 FIGURE 8 3 Sun Fire X4540 Server Rear Panel LEDs COMPACT FLASH CF CARD TABLE 8 2 Rear Panel Features Name Description 1 AC power connectors Verify that the PS LEDs are green Each power supply has its own AC connector with a clip to secure its power cable
81. aring CPU for booting to OS by copying all of the context of the BSP to all application processors present NOTE APs are left in the CLI HLT state 8613 Initialize PM regs and PM PCI regs at Early POST Initialize multi host bridge if system supports it Setup ECC options before memory clearing Enable PCI X clock lines in the 8131 0024 Uncompress and initialize any platform specific BIOS modules 862a BBS ROM initialization 002a Generic Device Initialization Manager DIM Disable all devices 042a ISA PnP devices Disable all devices 052a PCI devices Disable all devices 122a ISA devices Static device initialization 152a PCI devices Static device initialization 252a PCI devices Output device initialization 202c Initializing different devices Detecting and initializing the video adapter installed in the system that has optional ROMs 002e Initializing all the output devices 0033 Initializing the silent boot module Set the window for displaying text information 0037 Displaying sign on message CPU information setup key message and any OEM specific information 4538 PCI devices IPL device initialization 5538 PCI devices General device initialization 8600 Preparing CPU for booting to OS by copying all of the context of the BSP to all application processors present NOTE APs are left in the CLI HLT state POST Code Checkpoints The POST code checkpoints are the largest set of checkpoints during the BI
82. ate LED Switch for five seconds turns all indicators ON for 15 seconds 2 System Fault White On When service action is required 3 Power Operation Green Steady Power is On Blink Standby power is On but main power is Off Off Power is Off 4 System power Grey Press to power on main power for all server button components 5 Top failure LED Amber On HDD or fan fault 6 Rear failure LED Amber On Power supply or system controller fault service is required 7 Over Temperature Amber On When system is over temperature LED Chapter 6 Status Indicator LEDs 63 Rear Panel TABLE 6 2 shows the features of the rear panel TABLE 6 2 lists and describes each feature FIGURE 6 3 Sun Fire X4500 Server Rear Panel SYSTEM CONTROLLER SO ANLO et ee ee L ou IG 64 Book Title without trademarks or an abbreviated book title July 2009 Internal Status Indicator LEDs The Sun Fire X4500 server has internal status board LEDs for the CPU board the CPU and DIMM slots on the CPU board See the following figures and tables for information about the LEDs that are viewable on the outside of the server m TABLE 6 2 and TABLE 6 3 describes the internal LEDs m FIGURE 6 4 describe the disk drive and fan tray LEDs m TABLE 6 2 describe the disk drive and fan tray LEDs m FIGURE 6 6 describe the LED and button location The system includes internal LEDs on the disk drives the fan trays and the PCI slots
83. ault LEDs 173 Rear Panel Features FIGURE 14 3 shows all the features of the rear panel TABLE 14 2 describes each rear panel feature FIGURE 14 3 Sun Fire X4540 Server Rear Panel TABLE 14 2 Rear Panel Features Name Description 1 AC power connectors Verify that the PS LEDs are green Each power supply has its own AC connector with a clip to secure its power cable 2 Chassis ground Connect grounding straps here 3 0 PCI e 1 PCI e 2 PCI e Slots for three PCI e cards 4 Locate button LED White Operators can turn this LED On remotely to help then locate the server in a crowded server room Press to turn off 5 Fault LED Amber When on service action required A Steady Power is On Off Power is Off 174 Book Title without trademarks or an abbreviated book title July 2009 TABLE 14 2 Rear Panel Features Name Description 6 OK LED Green Service action allowed When On service action is required Blink Standby power is On but main power is Off 7 SVC Service buttons SP Reset Service Processor 10 11 12 13 14 SC System controller status LEDs SER MGT NET MGT S 10 100 1000 USB connectors Video connector Compact flash CF card NMI Non Maskable Interrupt dump Sends an NMI to the CPU Used for debugging only Host Reset Host Bus Adapter Do not use these buttons unless instructed by Sun service personnel To operate these buttons insert a st
84. being executed Main BIOS checksum is tested 01d7 Restoring CPUID moving bootblock runtime interface module to RAM determine whether to execute serial flash 01d8 Uncompressing runtime module into RAM Storing CPUID information in memory 01d9 Copying main BIOS into memory Olda Giving control to BIOS POST 0004 Check CMOS diagnostic byte to determine if battery power is OK and CMOS checksum is OK If the CMOS checksum is bad update CMOS with power on default values 00c2 Set up boot strap processor for POST This includes frequency calculation loading BSP microcode and applying user requested value for GART Error Reporting setup question 00c3 Errata workarounds applied to the BSP 78 amp 110 00c6 Re enable cache for boot strap processor and apply workarounds in the BSP for errata 106 107 69 and 63 if appropriate 00c7 HT sets link frequencies and widths to their final values 000a Initializing the 8042 compatible Keyboard Controller 000c Detecting the presence of Keyboard in KBC port 000e Testing and initialization of different Input Devices Traps the INTO9h vector so that the POST INTO9h handler gets control for IRQ1 166 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 TABLE 13 1 POST Codes Continued Post Code Description 8600 Preparing CPU for booting to OS by copying all of the context of the BSP to all application processors present NOTE APs are left in the CLI HLT state de00 Prep
85. ce including all processors Machine Check Error registers events 14 to 18 1f After BIOS detected that a UCE had occurred it located the DIMM and reset 0x03 refers to reboot count 21 to 25 BIOS off lined faulty DIMMs from system memory space and reported them Each DIMM of a pair is being reported since hardware UCE evidence cannot lead BIOS any further than detection of a faulty pair Correctable DIMM Errors If a DIMM has 24 or more correctable errors in 24 hours it is considered defective and should be replaced 126 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 At this time CEs are not logged in the server s system event logs They are reported or handled in the supported operating systems as follows m Windows Server a A Machine Check error message bubble appears on the task bar b The user must manually open Event Viewer to view errors Access Event Viewer through this menu path Start gt Administration Tools gt Event Viewer c The user can then view individual errors by time to see details of the error m Solaris Solaris FMA reports and sometimes retires memory with correctable Error Correction Code ECC errors See your Solaris Operating System documentation for details Use the command fmdump eV to view ECC errors m Linux The HERD utility can be used to manage DIMM errors in Linux See the x64 Servers Utilities Reference Manual for details a If HERD is installed it copies messages f
86. ce processor clock settings If the clock settings change the change is reflected in the time stamps When the service processor reboots the SP clock is set to Thu Jan 1 00 00 00 UTC 1970 The SP reboots as a result of the following A complete system unplug replug power cycle An IPMI command for example mc reset cold A command line interface CLI command for example reset SP ILOM web GUI operation for example from the Maintenance tab selecting Reset SP An SP firmware upgrade After an SP reboot the SP clock is changed by the following When the host is booted The host s BIOS unconditionally sets the SP time to that indicated by the host s RTC The host s RTC is set by the following operations a When the host s CMOS is cleared as a result of changing the host s RTC battery or inserting the CMOS clear jumper on the system controller The host s RTC starts at Jan 1 00 01 00 2002 a When the host s operating system sets the host s RTC The BIOS does not consider time zones Solaris and Linux software respect time zones and will set the system clock to UTC Therefore after the OS adjusts the RTC the time set by the BIOS will be UTC a When the user sets the RTC using the host BIOS Setup screen Continuously via NTP if NTP is enabled on the SP NTP jumping is enabled to recover quickly from an erroneous update from the BIOS or user NTP servers provide UTC time Therefore if NTP is enabled on the SP the SP cl
87. configuration allows for out of band server management and second the operating system is not burdened with transporting system status data Your Sun Fire X4500 Service Processor SP is IPMI v2 0 compliant You can access IPMI functionality through the command line with the IPMItool utility either in band or out of band Additionally you can generate an IPMI specific trap from the web interface or manage the server s IPMI functions from any external management solution that is IPMI v1 5 or v2 0 compliant For more information about the IPMI v2 0 specification go to http www intel com design servers ipmi spec htm spec2 About IPMItool IPMItool is included on the Sun Fire X4500 server Tools and Drivers CD 705 1438 IPMItool is a simple command line interface that is useful for managing IPMI enabled devices You can use this utility to perform IPMI functions with a kernel device driver or over a LAN interface IPMItool enables you to manage system hardware components monitor system health and monitor and manage system environmentals independent of the operating system Locate IMPItool and its related documentation on your Sun Fire X4500 Server Tools and Drivers CD or download this tool at http ipmitool sourceforge net IPMItool Man Page After you install the IPMItool package you can access detailed information about command usage and syntax from the man page that is installed From a command line type the following c
88. d ILOM 2 0 2 5 and Later 81 Pre ILOM 2 0 2 5 and No USB Devices 83 Pre ILOM 2 0 2 5 and One USB Device 83 ILOM 2 0 2 5 or Later and No USB Device 84 ILOM 2 0 2 5 or Later and One USB Device 84 ILOM 2 0 2 5 or Later and Three USB Storage Devices 86 A Sun Fire X4500 Sensor Locations 87 B Error Handling 91 Handling of Uncorrectable Errors 91 Handling of Correctable Errors 94 Handling of Parity Errors PERR 96 Handling of System Errors SERR 99 Handling Mismatching Processors 101 Hardware Error Handling Summary 102 Part II Sun Fire X4540 Server Diagnostics Guide 8 Initial Inspection of the Server 109 Service Visit Troubleshooting Flowchart 109 Gathering Service Visit Information 111 Troubleshooting Power Problems 111 vi Undefined BookTitleFooter July 2009 10 11 External Inspection of the Server 112 Internal Inspection of the Server 115 Using SunVTS Diagnostic Software 119 About SunVTS Diagnostic Software 119 Accessing SunVTS 120 SunVTS Documentation 120 Running SunVTS Diagnostic Tests 120 Using the Bootable Diagnostics CD 120 SunVTS Log Files 120 Requirements 121 Using the Bootable Diagnostics CD 121 Reviewing SunVTS Log Files 122 Troubleshooting DIMM Problems 123 DIMM Population Rules 123 Supported DIMM Configurations 124 DIMM Replacement Policy 124 How DIMM Errors Are Handled by the System 125 Uncorrectable DIMM Errors 125 Correctable DIMM Errors 126 BIOS DIMM Error Messages 127 DIMM Fault LEDs 128 I
89. dictive Failure Deasserted ft0O fm2 led 00h ns 29 2 Generic Device 20h 19h 2 ft1 fm0 fail 40h ok 29 3 Predictive Failure Deasserted ft1 fm0 led 00h ns 29 3 Generic Device 20h 19h 3 ftl fml fail 4ih ok 29 4 Predictive Failure Deasserted ftl fm1 led 00h ns 29 4 Generic Device 20h 19h 4 ftl fm2 fai 42h ok 29 5 Predictive Failure Deasserted ft1i fm2 led 00h ns 29 5 Generic Device 20h 19h 5 ft0 fm0 0 speed 43h ok 29 0 6000 RPM ftO fmi fO speed 44h ok 29 1 6000 RPM ftO fm2 fO speed 45h ok 29 2 6000 RPM ftl fm0 0 speed 46h ok 29 3 6000 RPM ftl fml f 0 speed 47h ok 29 4 6000 RPM ftl fm2 0 speed 48h ok 29 5 6000 RPM Other queries can include a particular type of sensor The command in the following example returns a list of all Temperature type sensors in the SDR ipmitool I lanplus H IPADDR U root P changeme sdr type temperature Chapter 4 Using IPMItool to View System Information 37 sys tempfail 03h ok 23 0 Predictive Failure Deasserted mb t_amb osh ok 7 0 25 degrees C fp t amb 14h ok 12 0 25 degrees C ps t_amb 1Bh ok 10 0 24 degrees C io t_amb 22h ok 15 0 23 degrees C pO t core 2Ch ok 3 0 35 degrees C pl t_core 35h ok 3 1 36 degrees C Using IPMItool to View the ILOM SP System Event Log The ILOM SP System Event Log SEL provides storage of all system events You can view the SEL with IPM
90. ds fan on isc you would use it in a command as follows ipmitool I lanplus H IPADDR U root P changeme exec leds fan on isc 46 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 CHAPTER 5 Event Logs and POST Codes This appendix contains information about the BIOS event log the BMC system event log the power on self test POST and console redirection For more information on the BIOS event log and post codes refer to the Sun Fire X4500 Server Service Manual 819 4359 This appendix includes the following sections m Viewing Event Logs on page 47 m Power On Self Test POST on page 50 How BIOS POST Memory Testing Works on page 51 Redirecting Console Output on page 51 Changing POST Options on page 53 POST Codes on page 55 POST Code Checkpoints on page 57 Viewing Event Logs Use this procedure to view the BIOS event log and the BMC system event log 1 To turn on main power mode all components powered on use a ball point pen or stylus to press and release the Power button on the server front panel See FIGURE 8 4 When main power is applied to the full server the Power OK LED next to the Power button lights and remains lit 2 Enter the BIOS Setup utility by pressing the F2 key while the system is performing the power on self test POST The BIOS Main menu screen is displayed 47 3 View the BIOS event log a From the BIOS Main Menu sc
91. e 4 Troubleshooting DIMM Problems on page 6 Service Visit Troubleshooting Flowchart Use the following flowchart as a guideline for using the subjects in this book to troubleshoot the server FIGURE 1 1 Troubleshooting Flowchart To perform this task Gather initial service visit information n en Investigate any powering on problems Y Perform external visual inspection and internal visual inspection Y View BIOS event logs and POST messages Y View service processor logs and sensor information E n Run SunVTS diagnostics Refer to these sections Gathering Service Visit Information on page 2 Initial Inspection of the Server on page 1 Externally Inspecting the Server on page 4 Internally Inspecting the Server on page 4 Troubleshooting DIMM Problems on page 6 Viewing Event Logs on page 15 About Power On Self Test POST page 162 on Using the ILOM Service Processor GUI to View System Information on page 19 Using IPMItool to View System Information on page 31 Diagnosing Server Problems With the Boota ble Diagnostics CD on page 16 Gathering Service Visit Information The first step in determining the cause of the problem with the server is to gather whatever information you can from the service call paperwork or the onsite personnel Use the following general guideline steps when you begi
92. e CD Therefore the test log files are not on the server s hard disk drive and they will be deleted when you power cycle the server 122 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 CHAPTER 10 Troubleshooting DIMM Problems This chapter describes how to detect and correct problems with the Sun Fire Sun Fire X4500 X4540 Servers server s Dual Inline Memory Modules DIMMs It includes the following sections a DIMM Population Rules on page 123 m Supported DIMM Configurations on page 124 m DIMM Replacement Policy on page 124 m How DIMM Errors Are Handled by the System on page 125 m Isolating and Correcting DIMM ECC Errors on page 130 DIMM Population Rules The DIMM population rules for the server are as follows m Each CPU can support a maximum of eight DIMMs m The DIMM slots are paired and the DIMMs must be installed in pairs 0 1 2 3 4 5 and 6 7 See FIGURE 10 1 The memory sockets are colored black or white to indicate which slots are paired by matching colors m DIMMs are populated starting from the outside away from the CPU and working toward the inside m CPUs with only a single pair of DIMMs must have those DIMMs installed in that CPU s outside white DIMM slots 6 and 7 See FIGURE 10 1 a Only DDR2 800 Mhz 667Mhz and 533Mhz DIMMs are supported m Each pair of DIMMs must be identical same manufacturer size and speed 123 Supported DIMM C
93. e Devices Auto Detecting Pri Master ATAPI CDROM Pri Master DU 28SL 1 0A Ultra DMA Mode 2 Auto detecting USB Mass Storage Devices Device 01 AMI Virtual CDROM Device 02 AMI Virtual Floppy 2 USB mass storage devices found and configured 0085 BMC Respond ing A Hyper Transport sync flood error occurred on last boot PCI System Error m SERR and HyperTransport Synch Flood Error are logged in DMI and the SP SEL See the following sample output SEL Record ID 0a00 Record Type 00 Timestamp 08 10 2005 06 05 32 Generator ID 0001 EvM Revision 04 Sensor Type Critical Interrupt Sensor Number 00 Event Type Sensor specific Discrete Event Direction Assertion Event Event Data OSf ff Description PCI SERR m FIGURE D 6 shows an example DMI log screen from the BIOS Setup page with a system error Appendix D Error Handling 193 194 FIGURE D 6 DMI Log Screen System Error Listed BIOS SETUP UTILITY View Event Log Ma all events as rea Clea vent Log 99 12 05 14 23 47 A Hyper Transpor flood error occurred 09 1270 3 36 p m Er ror Book Title without trademarks or an abbreviated book title July 2009 Handling Mismatched Processors This section lists facts and considerations about how the server handles mismatching processors m The BIOS performs a complete POST m The BIOS displays a report of any mismatching CPUs as shown in the following example
94. e SP NTP jumping is enabled to recover quickly from an erroneous update from the BIOS or user NTP servers provide UTC time Therefore if NTP is enabled on the SP the SP clock will be in UTC m Via the CLI ILOM web GUI and IPMI Viewing Replaceable Component Information Depending on the component you select information about the manufacturer component name serial number and part number can be displayed Log in to the SP as Administrator or Operator to reach the ILOM web GUI Chapter 11 Using the ILOM Service Processor GUI to View System Information 137 a Type the IP address of the server s SP into your web browser The Sun Integrated Lights Out Manager Login screen is displayed b Type your user name and password When you first try to access the ILOM Service Processor you are prompted to type the default user name and password Type the default user name and password Default user name root Default password changeme 2 From the System Information tab select Components The Replaceable Component Information page is displayed See FIGURE 11 2 FIGURE 11 2 Replaceable Component Information Page User root Administrati Serve 03BA84D7B6 4 Sun Integrated Lights Out Manager mai Sun Microsystems Inc ae l System Monitoring REGLE n I ee AT LA Remote Control Maintenance versions Session Time Out Components Replaceable Component Information View component part numbers seria
95. e X4500 Server supports 48 internal SATA drives A physical map of these drives is located on the Sun Fire X4500 Server chassis label The hd utility is included in the SUNWhd package and is preinstalled on your server The hd utility is a hard disk drive utility for the x86 systems such as the Sun Fire X4500 Server It is used to determine the logical to physical device mapping of your Sun Fire X4500 Server You need to understand this mapping to administer the system manage the hard drives and troubleshoot the server The hd utility output enables you to locate all the disks visually based on the physical topology of the Sun Fire X4500 server drives by providing a color coded hard drive location map The utility s output gives you a what you see is what you get WYSIWYG physical location map of the Sun Fire X4500 server s drives The hd utility provides the following features m Displays of all the available storage devices on the system m Color coded hard drive location maps m Remote analysis 71 This utility has a run time color mode to help you distinguish the status of a hard drive It is a complementary tool to Solaris OS disk maintenance and configuration administration programs like format 1M and cfgadm 1M The hd utility output can also help you to identify which drives have not been enumerated FIGURE 7 1 shows the Sun Fire X4500 server drive layout FIGURE 7 1 Server Disk Drive and Fan Tray Layout 72 Book Ti
96. e an SDR cache for later use type the sar dump command For example ipmitool I lanplus H IPADDR U root P changeme sdr dump galaxy sdr Dumping Sensor Data Repository to galaxy sdr After you have generated a cache file it can be supplied to future invocations of IPMItool with the s option For example ipmitool I lanplus H IPADDR U root P changeme S galaxy sdr sel elist 100 Pre Init Time stamp Entity Presence psl prsnt Device Absent 200 Pre Init Time stamp Entity Presence io f0 prsnt Device Absent 300 Pre Init Time stamp Power Supply ps0 vinok State Asserted Sensor Numbers and Sensor Names in SEL Events Depending on which IPMI command you use the sensor number that is displayed for an event might appear in slightly different formats See the following examples 40 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 m The sensor number for the sensor ps1 prsnt power supply 1 present can be displayed as either 1Fh or 0x1F m 38h is equivalent to 0x38 m 4Bh is equivalent to 0x4B The output from certain commands might not display the sensor name along with the corresponding sensor number To see all sensor names in your server mapped to the corresponding sensor numbers you can use the following command ipmitool H 129 144 82 21 U root P changeme sdr elist sys id ooh ok 23 0 State Asserted sys intsw O1h ok 23 0 sys psfail 02h ok 23 0 Predictive Fa
97. e as follows m Force BIOS Remove the Sun logo and display Option ROM output m Keep Current Do not remove the Sun logo The Option ROM output is not displayed 54 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 m Boot Num Lock This option is On by default keyboard Num Lock is turned on during boot If you set this to off the keyboard Num Lock is not turned on during boot m Wait for FI if Error This option is disabled by default If you enable this the system will pause if an error is found during POST and will only resume when you press the F1 key m Interrupt 19 Capture This option is reserved for future use Do not change m Default Boot Order The letters in the brackets represent the boot devices To see the letters defined position your cursor over the field and read the definition in the right side of the screen POST Codes TABLE 5 1 contains descriptions of each of the POST codes listed in the same order in which they are generated These POST codes appear as a four digit string that is a combination of two digit output from primary I O port 80 and two digit output from secondary I O port 81 In the POST codes listed in TABLE 5 1 the first two digits are from port 81 and the last two digits are from port 80 TABLE 5 1 POST Codes Post Code Description 00d0 After Power On Reset POR PCI configuration space initialization Enabling 8111 s SMBus 00d1 Keyboard controller BAT Wakin
98. ed in the system Also Check for DEL or ESC keys to limit memory test Display total memory in the system 3C By this point RAM read write test is completed program memory holes or handle any adjustments needed in RAM size with respect to NB Test if HT Module found an error in BootBlock and CPU compatibility for MP environment 40 Detect different devices Parallel ports serial ports and coprocessor in CPU etc successfully installed in the system and update the BDA EBDA etc 50 Programming the memory hole or any kind of implementation that needs an adjustment in system RAM size if needed 52 Updates CMOS memory size from memory found in memory test Allocates memory for Extended BIOS Data Area from base memory 60 Initializes NUM LOCK status and programs the KBD typematic rate 75 Initializes Int 13 and prepare for IPL detection 78 Initializes IPL devices controlled by BIOS and option ROMs 7A Initializes remaining option ROMs 7C Generate and write contents of ESCD in NVRam Chapter 13 Event Logs and POST Codes 169 TABLE 13 2 POST Code Checkpoints Continued Post Code Description 84 Log errors encountered during POST 85 Display errors to the user and gets the user response for error 87 Execute BIOS setup if needed requested 8C After all device initialization is done programmed any user selectable parameters relating to NB SB such as timing parameters non cacheable regions and the shadow RAM cacheabil
99. efits First this configuration allows for out of band server management Second the operating system is not burdened with transporting system status data 143 Your Sun Fire X4540 Service Processor SP is IPMI v2 0 compliant You can access IPMI functionality through the command line with the IPMItool utility either in band or out of band Additionally you can generate an IPMI specific trap from the web interface or manage the server s IPMI functions from any external management solution that is IPMI v1 5 or v2 0 compliant For more information about the IPMI v2 0 specification go to http www intel com design servers ipmi spec htm spec2 About IPMItool IPMItool is a simple command line interface used to manage IPMI enabled devices You can use this utility to perform IPMI functions with a kernel device driver or over a LAN interface IPMItool allows you to manage system hardware components monitor system health and monitor and manage system environmentals independent of the operating system IPMItool is included on the Sun Fire X4540 server Tools and Drivers CD 705 1438 Locate IMPItool and its related documentation on your Sun Fire X4540 Server Tools and Drivers CD or download this tool at http ipmitool sourceforge net IPMItool Man Page After you install the IPMItool package you can access detailed information about command usage and syntax from the man page that is installed From a command line type the follow
100. egory All Sensors yi Sensor Readings 77 sensors Status Name Reading State Asserted sys id State Asserted Sys intsw Predictive Failure Deasserted sys psfail Predictive Failure Deasserted sys tempfail Predictive Failure Deasserted sys fanfail Normal mb t_amb 24 degrees C Normal mb v bat 3 232 Volts Normal mbv 3v3stby 3 217 Volts Unknown mb v_ 3v3 Not Available Unknown mb v_ 5v Not Available 3 Select the type of sensor readings that you want to view from the drop down menu You can select All Sensors Temperature Sensors Voltage Sensors or Fan Sensors Chapter 3 Using the ILOM Service Processor GUI to View System Information 27 The sensor readings are displayed The Sensor Readings fields are described in TABLE 3 2 TABLE 3 2 Sensor Readings Fields Field Description Status Reports the status of the sensor including State Asserted State Deasserted Predictive Failure Device Inserted Device Present Device Removed Device Absent Unknown and Normal Name Reports the name of the sensor The names correspond to the following components e sys System or chassis e bp Back panel e fp Front panel e mb Motherboard e io I O board e p0 Processor 0 e pl Processor 1 e ft0 Fan tray 0 e ftl Fan tray 1 e pdb Power distribution board e ps0 Power supply 0 e psi Power supply 1 Readin Reports the rpm temperature and voltage measurements 8 P P p 8 4 Click the
101. enabled operating systems will shut down to standby power mode immediately Emergency shutdown Press and hold the Power button for four seconds to force main power off and enter standby power mode After main power is off the Power OK LED on the front panel blinks once every three seconds indicating that the server is in standby power mode Caution You must disconnect the AC power cords from the back panel of the server to completely power off the server When you use the Power button to enter standby power mode power is still applied to the graphics redirect and service processor GRASP board and power supply fans indicated when the Power OK LED is blinking Chapter 8 Initial Inspection of the Server 115 FIGURE 8 4 Sun Fire X4540 Server Front Panel 1 2 eo OA OH OO ow reno 8 Figure Legend 1 Power Button 2 Power OK LED 2 Remove the component covers including hard disk drive cover system controller cover and fan cover as required FIGURE 8 5 shows the server internal components For instructions on removing the component covers refer to the Sun Fire X4540 Server Service Manual 819 4359 116 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 FIGURE 8 5 Sun Fire X4540 Server Internal Components emu u Eoi Oi i gec eo p e e pec eo po Eoi yS gec e p FO edu u E ao ji gec E du Een FEN ii RD 3 Inspect the in
102. er lasts for about half an hour Note Disconnecting the AC power removes the fault indication To recover fault information view the SP SEL Refer to the Sun Integrated Lights Out Manager User s Guide m DIMM fault LED is off The DIMM is operating properly m DIMM fault LED is flashing amber At least one of the DIMMs in this DIMM pair has reported 24 CEs within a 24 hour period m Motherboard Fault LED on mezzanine is on There is a fault on the motherboard This LED is there because you cannot see the motherboard LEDs when the mezzanine board is present Note The Motherboard Fault LED operates independently of the Press to See Fault button and does not operate on stored power See FIGURE 10 1 for the locations of DIMMs and LEDs on the motherboard 128 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 FIGURE 10 1 DIMMs and LEDs on Motherboard 1 2 3 4 Figure Legend DIMMs 0213 CPU 1 under heatsink CPU 0 under heatsink DIMMs 3 1 20 DIMM fault LEDs CPU 1 fault LED Battery fault LED CPU 0 fault LED DIMM fault LED oon Oa fF QON Chapter 10 Troubleshooting DIMM Problems 129 A Isolating and Correcting DIMM ECC Errors If your log files report an ECC error or a problem with a DIMM complete the steps below until you can isolate the fault In this example the log file reports an error with the DIMM in CPUO slot 7 The fault LEDs on CPUO slots 6 and 7 are on
103. er on default values and clear passwords Initialize status register A Initializes data variables that are based on CMOS setup questions Initializes both the 8259 compatible PICs in the system 05 Initializes the interrupt controlling hardware generally PIC and interrupt vector table 06 Do R W test to CH 2 count reg Initialize CH 0 as system timer Install the POSTINT1Ch handler Enable IRQ 0 in PIC for system timer interrupt Traps INT1Ch vector to POSTINT1ChHandlerBlock CO Early CPU Init Start Disable Cache Init Local APIC Cl Set up boot strap processor information C2 Set up boot strap processor for POST This includes frequency calculation loading BSP microcode and applying user requested value for GART Error Reporting setup question C3 Errata workarounds applied to the BSP 78 amp 110 C5 Enumerate and set up application processors This includes microcode loading and workarounds for errata 78 110 106 107 69 63 Chapter 5 Event Logs and POST Codes 57 TABLE 5 2 POST Code Checkpoints Continued Post Code Description C6 Re enable cache for boot strap processor and apply workarounds in the BSP for errata 106 107 69 and 63 if appropriate In case of mixed CPU steppings errors are sought and logged and an appropriate frequency for all CPUs is found and applied NOTE APs are left in the CLI HLT state C7 The HT sets link frequencies and widths to their final values This routine gets called after
104. ervers Diagnostics Guide July 2009 Remove the DIMMs from the DIMM slots in the CPU Refer to your server s service manual for details Visually inspect the DIMMs for physical damage dust or any other contamination on the connector or circuits Visually inspect the DIMM slot for physical damage Look for cracked or broken plastic on the slot Dust off the DIMMs clean the contacts and reseat them Caution Use only compressed air to dust DIMMs 10 11 12 If there is no obvious damage replace any failed DIMMs For UCEs if the LEDs indicate a fault with the pair replace both DIMMs Ensure that they are inserted correctly with ejector latches secured Reconnect AC power cords to the server Power on the server and run the diagnostics test again Review the log file If the tests identify the same error the problem is in the CPU not the DIMMs Chapter 10 Troubleshooting DIMM Problems 131 132 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 CHAPTER 1 1 Using the ILOM Service Processor GUI to View System Information This chapter contains information about using the Integrated Lights Out Manager ILOM service processor SP GUI to view monitoring and maintenance information for your server This chapter includes the following sections m Connecting the SP to a Serial Port on page 133 m Viewing ILOM SP Event Logs on page 134 m Viewing Replaceable Co
105. es event records with sensor data records to produce descriptive event output It takes longer to execute because it has to read from both the SEL and the Static Data Repository SDR For increased speed generate an SDR cache before using the sel elist command See Using the Sensor Data Repository SDR Cache on page 151 For example ipmitool I lanplus H IPADDR U root P changeme sel elist first 3 100 Pre Init Time stamp Temperature fp t amb Upper Non critical going high Reading 31 gt Threshold 30 degrees C 200 Pre Init Time stamp Power Supply psi pwrok State Deasserted 300 Pre Init Time stamp Entity Presence psl prsnt Device Present Certain qualifiers are available to refine and limit the SEL output If you want to see only the first NUM records add that as a qualifier to the command If you want to see the last NUM records use that qualifier For example to see the last three records in the SEL type the following command ipmitool I lanplus H IPADDR U root P changeme sel elist last 3 800 Pre Init Time stamp Entity Presence psi prsnt Device Absent 900 Pre Init Time stamp Phys Security sys intsw Gen Chassis intrusion a00 Pre Init Time stamp Entity Presence ps0 prsnt Device Present If you want to get more detailed information on a particular event you can use the sel get ID command in which you specify an SEL record ID For example ipmitool I lanplus H
106. estigate the following conditions that can trigger an automatic shutdown sequence A power off sequence is initiated either by a request from the board management controller BMC or a fault condition The conditions that trigger the BMC to issue a shutdown request are m An over temperature condition for more than 1 second a Multiple fan failures The fault conditions that trigger a shutdown are a All power supplies have failed or have been removed Chapter 1 Initial Inspection of the Server 3 A power supply has been out of spec for more than 100 mS The hot swap circuit has faulted m An over temperature condition has occurred Note Any power supply that is out of spec causes a reset but only power supplies that remain out of spec for more than 100 mS cause a shutdown Externally Inspecting the Server To perform a visual inspection of the external system 1 Inspect the external status indicator LEDs which can indicate component malfunction For the LED locations and descriptions of their behavior see Front Panel Features on page 172 2 Verify that nothing in the server environment is blocking air flow or making a contact that could short out power 3 If the problem is not evident continue with the next section Internally Inspecting the Server on page 4 Internally Inspecting the Server To perform a visual inspection of the internal system 1 Choose a method for shutting do
107. f sel list The sel elist command cross references event records with sensor data records to produce descriptive event output It takes longer to execute because it has to read from both the SEL and the Static Data Repository SDR For increased speed generate an SDR cache before using the sel elist command See Using the Sensor Data Repository SDR Cache on page 40 For example ipmitool I lanplus H lt IPADDR gt U root P changeme sel elist first 3 100 Pre Init Time stamp Temperature fp t_amb Upper Non critical going high Reading 31 gt Threshold 30 degrees C 200 Pre Init Time stamp Power Supply psl pwrok State Deasserted 300 Pre Init Time stamp Entity Presence psl prsnt Device Present Qualifiers allow you to refine and limit the SEL output To see only the first NUM records add that as a qualifier to the command To see the last NUM records use that qualifier For example to see the last three records in the SEL type the following command ipmitool I lanplus H IPADDR U root P changeme sel elist last 3 800 Pre Init Time stamp Entity Presence psi prsnt Device Absent 900 Pre Init Time stamp Phys Security sys intsw Gen Chassis intrusion a00 Pre Init Time stamp Entity Presence ps0 prsnt Device Present To view more detailed information on a particular event you can use the sel get ID command in which you specify an SEL record ID For example ipmitool I
108. fatal supply failure AC DC Fault LEDs are lit PS_VIN_GOOD or PS_PWR_OK signals are deasserted DC DC Any The Service Action Required LED is lit the SP SEL Fatal power POWER_GOOD system is powered down to standby power converter signal is deasserted mode and the Power LED enters standby failure from the DC DC blink state converters Voltage The SP monitors The Service Action Required LED and Power SP SEL Fatal above below system voltagesand Supply Fault LED blink Threshold detects voltage above or below a given threshold High The SP monitors The Service Action Required LED and System SP SEL Fatal temperature CPU and system Overheat Fault LED blink The system temperatures and controller is shut down above the specified detects critical level temperatures above a given threshold Processor The CPU drives the CPLD shuts down power to the CPU The SP SEL Fatal thermal trip THERMTRIP L Service Action Required LED and System signal when it Overheat Fault LED blink detects an overtemp condition Boot device The BIOS is not able The BIOS goes to the next boot device in the DMI Log Non fatal failure to boot from a device in the boot device list list If all devices in the list fail an error message is displayed Retry from beginning of list SP can control or change boot order Appendix D Error Handling 199 200 Book Title without trademarks or an abbreviated book title July 2009 Index A anonymous use
109. fer to the Sun Fire X4500 Server Service Manual 819 4359 Chapter 1 Initial Inspection of the Server 5 3 Inspect the internal status indicator LEDs which can indicate component malfunction For the LED locations and descriptions of their behavior see Internal Status Indicator LEDs on page 175 Note You can hold down the Locate button on the server back panel or front panel for 5 seconds to initiate a push to test mode that illuminates all other LEDs both inside and outside of the chassis for 15 seconds 4 Verify that there are no loose or improperly seated components 5 Verify that all cable connectors inside the system are firmly and correctly attached to their appropriate connectors 6 Verify that any after factory components are qualified and supported For a list of supported PCI cards and DIMMs refer to the Sun Fire X4500 Server Service Manual 819 4359 7 Check that the installed DIMMs comply with the supported DIMM population rules and configurations as described in Troubleshooting DIMM Problems on page 6 8 Replace the component covers 9 To restore main power mode to the server all components powered on use a ballpoint pen or stylus to press and release the Power button on the server front panel See FIGURE 1 2 When main power is applied to the full server the Power OK LED next to the Power button lights and remains lit 10 If the problem with the server is not evident
110. g up from PM Saving power on CPUID in scratch CMOS 00d2 Disable cache full memory sizing and verify that flat mode is enabled 00d3 Memory detections and sizing in boot block cache disabled IO APIC enabled 01d4 Test base 512KB memory Adjust policies and cache first 8MB 01d5 Bootblock code is copied from ROM to lower RAM BIOS is now executing out of RAM 01d6 Key sequence and OEM specific method is checked to determine if BIOS recovery is forced If next code is E0 BIOS recovery is being executed Main BIOS checksum is tested 01d7 Restoring CPUID moving bootblock runtime interface module to RAM determine whether to execute serial flash 01d8 Uncompressing runtime module into RAM Storing CPUID information in memory 01d9 Copying main BIOS into memory Olda Giving control to BIOS POST 0004 Check CMOS diagnostic byte to determine if battery power is OK and CMOS checksum is OK If the CMOS checksum is bad update CMOS with power on default values Chapter 5 Event Logs and POST Codes 55 TABLE 5 1 POST Codes Continued Post Code Description 00c2 Set up boot strap processor for POST This includes frequency calculation loading BSP microcode and applying user requested value for GART Error Reporting setup question 00c3 Errata workarounds applied to the BSP 78 amp 110 00c6 Re enable cache for boot strap processor and apply workarounds in the BSP for errata 106 107 69 and 63 if appropriate 00c7 HT sets li
111. ge and fan sensors compact Compact sensor records Digital Discrete failure and presence sensors event Event only records Sensors used only for matching with SEL records mcloc MC locator records Management Controller sensors generic Generic locator records Generic devices LEDs fru FRU locator records FRU devices For example to see only the temperature voltage and fan sensors you would use the following command with the full argument ipmitool I lanplus H IPADDR U root P changeme sdr elist full fp t amb 0Ah ok 12 0 22 degrees C ps t_amb 11h ok 10 0 21 degrees C ps0 f 0 speed 15h ok 10 0 11000 RPM ps1 0 speed 19h ok 10 1 0 RPM mb t_amb 1Ah ok 7 0 25 degrees C mb v_bat 1Bh ok 7 0 3 18 Volts mb v_ 3v3stby ich ok 7 0 3 17 Volts mb v_ 3v3 iph ok 7 0 3 34 Volts mb v_ 5v 1Eh ok 7 0 5 04 Volts mb v_ 12v 1Fh ok 7 0 12 22 Volts mb v_ 12v 20h ok 7 0 12 20 Volts Chapter 12 Using IPMItool to View System Information 147 148 mb v_ 2v5core 21h ok 7 0 2 54 Volts mb v 1v8core 22h ok 7 0 1 83 Volts mb v 1v2core 23h ok 7 0 1 21 Volts io t_amb 24h ok 15 0 21 degrees C pO t core 2Bh ok 3 0 44 degrees C p0 v_ 1v5 2ch ok 3 0 1 56 Volts p0 v_ 2v5core 2Dh ok 3 0 2 64 Volts pO v 1v25core 2Eh ok 3 0 1 32 Volts pl t core 34h ok 3 40 degrees C pl v_ 1v5 35h ok 3 1 55 Volts pl v_ 2v5core 36h ok 3 1 2 6
112. he SP m If you could not connect to the SP there is likely a problem with the graphics redirect and service processor GRASP board Replace this board and then repeat Step 1 through Step 4 Refer to the Sun Fire X4500 Server Service Manual 819 4359 for instructions m If you successfully connected to the SP continue with the following procedures m Viewing ILOM SP Event Logs on page 21 a Viewing Replaceable Component Information on page 24 a Viewing Temperature Voltage and Fan Sensor Readings on page 26 20 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 Viewing ILOM SP Event Logs Events are notifications that occur in response to some actions The IPMI system event log SEL provides status information about the Sun Fire X4500 server s hardware and software to the ILOM software which displays the events in the ILOM web GUI To view event logs 1 Log in to the SP as Administrator or Operator to reach the ILOM web GUI a Type the IP address of the server s SP into your web browser The Sun Integrated Lights Out Manager Login screen is displayed b Type your user name and password When you first try to access the ILOM SP you are prompted to type the default user name and password The default user name and password are Default user name root Default password changeme 2 From the System Monitoring tab select Event Logs The System Event Logs page is displayed See FIGURE 3
113. her initial service visit information Server powers on No Investigate power on problems Perform external visual inspection i Perform internal visual inspection Y Inspect DIMMs y View BIOS event logs i View BIOS POST messages l View service processor logs and sensor information m Run SunVTS diagnostics Refer to these sections 1 Gathering Service Visit Information on page 111 Troubleshooting Power Problems on page 111 External Inspection of the Server on page 112 or Identifying Status and Fault LEDs on page 171 Internal Inspection of the Server on page 115 Troubleshooting DIMM Problems on page 123 Viewing Event Logs on page 159 POST Codes on page 166 Using the ILOM Service Processor GUI to View System Information on page 49 or Using IPMItool to View System Infor mation on page 61 Using the Bootable Diagnostics CD on page 120 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 Gathering Service Visit Information Use the following general guidelines when you begin troubleshooting 1 Collect initial service visit information from the service call paperwork or onsite personnel about the following items m Events that occurred prior to the failure m Whether any hardware or software was modified
114. hese servers TABLE B 1 Hardware Error Handling Summary Logged DMI Log or SP Error Description Handling SEL Fatal SP failure The SP fails to boot The SP controls the system reset so the Not logged Fatal upon application of system may power on but will not come out system power of reset e During power up the SP s boot loader turns on the power LED e During SP boot Linux startup and SP sanity check the power LED blinks e The LED is turned off when SP management code the IPMI stack is started e At exit of BIOS POST the LED goes to STEADY ON state SP failure SP boots but fails The SP controls the system RESET so the Not logged Fatal POST system will not come out of reset BIOS POST Server BIOS does There are fatal and non fatal errors in POST failure not pass POST The BIOS does detect some errors that are announced during POST as POST codes on the bottom right corner of the display on the serial console and on the video display Some POST codes are forwarded to the SP for logging The POST codes do not come out in sequential order and some are repeated because some POST codes are issued by code in add in card BIOS expansion ROMs In the case of early POST failures for example the BSP fails to operate correctly BIOS just halts without logging For some other POST failures subsequent to memory and SP initialization the BIOS logs a message to the SP s SEL 102 Book Title without trademarks or an abbrev
115. host Power backplane IO controller board Appendix A Sun Fire X4500 Sensor Locations 89 90 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 APPENDIX B Error Handling This appendix contains information about how the servers process and log errors It includes the following sections Handling of Uncorrectable Errors on page 91 Handling of Correctable Errors on page 94 Handling of Parity Errors PERR on page 96 Handling of System Errors SERR on page 99 Handling Mismatching Processors on page 101 Hardware Error Handling Summary on page 102 Handling of Uncorrectable Errors This section explains how the server handles uncorrectable errors Note The BIOS ChipKill feature must be disabled if you are testing for failures of multiple bits within a DRAM ChipKill corrects for the failure of a four bit wide DRAM The BIOS logs the error to the SP system event log SEL through the board management controller BMC The SP s SEL is updated with the failing DIMM pair s specific bank address The system reboots The BIOS logs the error in DMI 91 Note If the error is on low 1MB the BIOS freezes after rebooting Therefore no DMI log is recorded m Anexample of the error reported by the SEL through IPMI 2 0 is as follows a When low memory is erroneous the BIOS is frozen on pre boot low memory test because the BIOS cannot decompress itself into fa
116. iated book title July 2009 TABLE B 1 Hardware Error Handling Summary Continued Logged DMI Log or SP Error Description Handling SEL Fatal Single bit With ECC enabled The CPU corrects the error in hardware No SP SEL Normal DRAM ECC in the BIOS Setup interrupt or machine check is generated by operation error the CPU detects the hardware The polling is triggered every and corrects a half second by SMI timer interrupts and is single bit error on done by the BIOS SMI handler the DIMM interface The BIOS SMI handler starts logging each detected error and stops logging when the limit for the same error is reached The BIOS s polling can be disabled through a software interface Single four bit With CHIP KILL The CPU corrects the error in hardware No SP SEL Normal DRAM error enabled in the BIOS interrupt or machine check is generated by operation Setup the CPU the hardware The polling is triggered every detects and corrects half second by SMI timer interrupts and is for the failure ofa done by the BIOS SMI handler four bit wide The BIOS SMI handler starts logging each DRAM on the detected error and stops logging when the DIMM interface limit for the same error is reached The BIOS s polling can be disabled through a software interface Uncorrectable The CPU detects an The sync flood method of handling this is SP SEL Fatal DRAM ECC uncorrectable used to prevent the erroneous data from error multiple bit DIMM being propagated across the
117. identical same manufacturer size and speed Supported DIMM Configurations TABLE 1 1 lists the supported DIMM configurations for the Sun Fire X4500 server TABLE 1 1 Supported DIMM Configurations Slot 3 Slot 2 Slot 1 Slot 0 Total Memory Per CPU 0 2 GB 0 2 GB 4 GB 2 GB 2 GB 2 GB 2 GB 8 GB Isolating and Correcting DIMM ECC Errors If your log files report an ECC error or a problem with a DIMM complete the steps below until you can isolate the fault In this example the log file reports an error with the DIMM in CPUO slot 1 The fault LEDs on CPUO slots 1 and 3 are lit To isolate and correct DIMM ECC errors Chapter 1 Initial Inspection of the Server 11 If you have not already done so shut down your server to standby power mode and remove the system controller cover Refer to the Sun Fire X4500 Server Service Manual 819 4359 Inspect the installed DIMMs to ensure that they comply with the DIMM Population Rules on page 11 and the Supported DIMM Configurations on page 11 Inspect the fault LEDs on the DIMM slot ejectors and the CPU fault LEDs on the CPU board See FIGURE 1 3 If any of these LEDs are lit they can indicate the component with the fault Disconnect the AC power cords from the server Caution Before handling components attach an ESD wrist strap to a chassis ground any unpainted metal surface The system s printed circuit boards and hard disk drives contain co
118. ilure Asserted In the sample output above the sensor name is in the first column and the corresponding sensor number is in the second column For a detailed explanation of each sensor listed by name refer to the Integrated Lights Out Manager Supplement Viewing Component Information With IPMItool You can view information about system hardware components The software refers to these components as field replaceable unit FRU devices To read the FRU inventory information on these servers you must first have the FRU ROMs programmed After programming is done you can see a full list of the available FRU data by using the fru print command as shown in the following example only two FRU devices are shown in the example but all devices would be shown ipmitool I lanplus H IPADDR U root P changeme fru print FRU Device Description Builtin FRU Device ID O Board Mfg BENCHMARK ELECTRONICS Board Product ASSY SERV PROCESSOR X4X00 Board Serial OOGOHSV 0523000195 Board Part Number 501 6979 02 Board Extra 000 000 00 Board Extra HUNTSVILLE AL USA Board Extra b302 Board Extra 06 Chapter 4 Using IPMItool to View System Information 41 Board Extra GRASP Product Manufacturer SUN MICROSYSTEMS Product Name ILOM FRU Device Description sp net0 fru ID 2 Product Manufacturer MOTOROLA Product Name FAST ETHERNET CONTROLLER Product Part Number MPC8248 FCC Product Seri
119. indicate to system software that it refers to a group of LEDS rather than a single physical LED TABLE 4 5 describes the LED sensor groups TABLE 4 5 LED Sensor Groups Group Name Sensors in Group sys power led bp power led fp power led sys locate led bp locate led fp locate led sys alert led bp alert led fp alert led For example to set both the front and back panel Power OK LEDS to a standby blink rate you could type the following command ipmitool I lanplus H IPADDR U root P changeme sunoem led set sys power led standby Set LED fp power led to STANDBY Set LED bp power led to STANDBY You could turn off the back panel Power OK LED but leave the front panel Power OK LED blinking by typing the following command ipmitool I lanplus H IPADDR U root P changeme sunoem led set bp power led off Set LED bp power led to OFF Using IPMItool Scripts For Testing For testing purposes it is often useful to change the status of all or at least several LEDs at once You can do this by constructing an IPMItool script and executing it with the exee command The following example shows a script to turn on all fan module LEDS sunoem led set f t0 fm0 led on sunoem led set f t0 fml led on sunoem led set f t0 fm2 led on Chapter 4 Using IPMItool to View System Information 45 sunoem led set fti fmO led on sunoem led set fti fmi led on sunoem led set fti fm2 led on If this script file were then named le
120. ing command man ipmitool Connecting to the Server With IPMItool To connect over a remote interface you must supply a user name and password The default user with administrator level access is root with password changeme This means you must use the U and P parameters to pass both user name and password on the command line as shown in the following example ipmitool I lanplus H lt IPADDR gt U root P changeme chassis status 144 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 Note If you encounter command syntax problems with your particular operating system you can use the ipmitool h command and parameter to determine which parameters can be passed with the ipmitool command on your operating system Also refer to the IPMItool man page by typing man ipmitool Note In the example commands shown in this appendix the default username root and default password changeme are shown You should type the user name and password that has been set for the server Enabling the Anonymous User In order to enable the Anonymous NULL user you must alter the privilege level on that account This will let you connect without supplying a u user option on the command line The default password for this user is anonymous To enable the anonymous user type the following commands ipmitool I lanplus H IPADDR U root P changeme channel setaccess 1 1 privilege 4 ipmitool I lanplus H IPADDR P anonym
121. instructions to access the service processor and redirect the console output so that the BIOS POST codes can be read 1 Initialize the BIOS Setup utility by pressing the F2 key while the system is performing the power on self test POST The BIOS Main menu screen is displayed Select the Advanced menu tab The Advanced Settings screen is displayed Select IPMI 2 0 Configuration The IPMI 2 0 Configuration screen is displayed Select the LAN Configuration menu item The LAN Configuration screen is displayed Determine the server s IP address a Select the IP Assignment option that you want to use 1 DHCP or Static If you choose DHCP the server s IP address is retrieved from your network s DHCP server and displayed using the following format Current IP address in BMC XXX XXX XXX XXX If you choose Static to assign the IP address manually perform the following steps b Type the IP address in the IP Address field You can also enter the subnet mask and default gateway respective fields settings in their c Select Commit and press Return to commit the changes d Select Refresh and press Return to see your new settings displayed in the Current IP address in BMC field Chapter 13 Event Logs and POST Codes 163 10 11 12 Start a web browser and type the service processor s IP address in the browser s URL field When you are prompted for a user name
122. itializes all the output devices 31 Allocate memory for ADM module and uncompress it Give control to ADM module for initialization Initializes language and font modules for ADM Activate ADM module 33 Initializes the silent boot module Sets the window for displaying text information 37 Displaying sign on message CPU information setup key message and any OEM specific information 38 Initializes different devices through DIM 39 Initializes DMAC 1 and DMAC 2 3A Initialize RTC date time 3B Test for total memory installed in the system Also Check for DEL or ESC keys to limit memory test Display total memory in the system 58 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 TABLE 5 2 POST Code Checkpoints Continued Post Code Description 3C By this point RAM read write test is completed program memory holes or handle any adjustments needed in RAM size with respect to NB Test if HT Module found an error in BootBlock and CPU compatibility for MP environment 40 Detect different devices Parallel ports serial ports and coprocessor in CPU etc successfully installed in the system and update the BDA EBDA etc 50 Programming the memory hole or any kind of implementation that needs an adjustment in system RAM size if needed 52 Updates CMOS memory size from memory found in memory test Allocates memory for Extended BIOS Data Area from base memory 60 Initializes NUM LOCK status and programs the KBD
123. ition type 14 Chapter 7 hd Utility 79 The following command displays the Sun Fire X4500 hard drive s physical slot number logical name and status present or absent TABLE 7 3 Command to Display Hard Drive Physical Slot Numbers hd q Here is an example of output listing the SunFire X4500 hard drive s physical slot number logical name and status FIGURE7 5 ha Utility Output Listing Drive Slot Number and Status Physical Slot Number Logical Name Status 0 c5t4 present 1 c4t0 present 3 c4t4 present 4 c7t0 present 5 c7t4 present 6 c6t0 present 7 c6t4 present 8 c1to present 9 cit4 present 10 coto present 11 cOt4 present 12 c5ti present 1 3 c5t5 present 14 c4ti absent 15 c4t5 absent 16 ETEL absent 17 c7t5 absent 18 c6t1 absent 19 c6t5 absent 20 crer absent 21 clt5 absent 22 c0t1 absent 23 c0t5 absent 24 c5t2 absent 25 c5t6 absent 26 c4t2 absent 27 c4t6 absent 28 c7t2 absent 29 c7t absent 30 c6t2 absent 31 c6t6 absent 32 clt2 absent 33 clt6 absent 34 c0t2 absent 80 Book Title without trademarks or an abbreviated book title July 2009 FIGURE 7 5 Continued hd Utility Output Listing Drive Slot Number and Status Physical Slot Number Logical Name Status 35 c0t6 absent 36 c5t3 absent 37 c5t7 absent 38 c4t3 absent 39 c4t7 absent 40 CIES absent 41 ETET absent 42 c6t3 absent 43 c6t7 absent 44 clt3 absent 45 clt7 absent 46 c0t3 absent 47 cOt7 absent The fol
124. ity and do any other NB SB PCIX OEM specific programming needed during Late POST Background scrubbing for DRAM and L1 and L2 caches are set up based on setup questions Get the DRAM scrub limits from each node 8D Build ACPI tables if ACPI is supported 8E Program the peripheral parameters Enable Disable NMI as selected 90 Late POST initialization of system management interrupt AO Check boot password if installed Al Clean up work needed before booting to OS A2 Takes care of runtime image preparation for different BIOS modules Fill the free area in FOOOh segment with OFFh Initializes the Microsoft IRQ Routing Table Prepares the runtime language module Disables the system configuration display if needed A4 Initialize runtime language module A7 Displays the system configuration screen if enabled Initialize the CPUs before boot which includes the programming of the MTRRs A8 Prepare CPU for OS boot including final MTRR values A9 Wait for user input at config display if needed AA Uninstall POST INT1Ch vector and INTO9h vector Deinitializes the ADM module AB Prepare BBS for Int 19 boot AC Any kind of Chipsets NB SB specific programming needed during End POST just before giving control to runtime code booting to OS Programmed the system BIOS OFO0000h shadow RAM cacheability Ported to handle any OEM specific programming needed during End POST Copy OEM specific data from POST_DSEG to RUN_CSEG BI Save system context f
125. kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk c From the IPMI 2 0 Configuration screen select View BMC System Event Log The log takes about 60 seconds to generate then it is displayed on the screen 5 If the problem with the server is not evident continue with Using the ILOM Service Processor GUI to View System Information on page 49 or Using IPMItool to View System Information on page 61 Power On Self Test POST The system BIOS provides a rudimentary power on self test The basic devices required for the server to operate are checked memory is tested the Marvell 885X6081 disk controller and attached disks are probed and enumerated and the two Intel dual gigabit Ethernet controllers are initialized The progress of the self test is indicated by a series of POST codes These codes are displayed at the bottom right corner of the system s VGA screen once the self test has progressed far enough to initialize the system video However the codes are 50 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 displayed as the self test runs and scroll off of the screen too quickly to be read An alternate method of displaying the POST codes is to redirect the output of the console to a serial port see Redirecting Console Output on page 51 How BIOS POST Memory Testing Works The BIOS POST memory testing is performed as follows 1 The first megabyte of DRAM is tested by the BIOS before the BIO
126. l numbers and manufacturing information Select a device TT Chassis Information Type Part Number Serial Number Board Information Manufacturer Product Name Serial Nurnber Part Number Product Information Manufacturer Name Product Name Serial Number Part Number Rack Mount Chassis 541 0250 01 OO6OHSI O503AM0387 BENCHMARK ELECTRONICS ASY MOTHERBRD GALAKY1 2 DO6OHSY O503000313 500 6974 01 SUN MICROSYSTEMS GALAXY 1 O503AMFO40 602 2813 01 3 Select a component from the drop down list Information about the selected component is displayed 138 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 4 If the problem with the server is not evident after viewing replaceable component information continue with Running SunVTS Diagnostic Tests on page 120 Viewing Temperature Voltage and Fan Sensor Readings This section describes how to view the Sun Fire X4540 server temperature voltage and fan sensor readings There are a total of six temperature sensors that are monitored They all generate IPMI events that will be logged in to the system event log SEL when an upper threshold is exceeded Three of these sensor readings are used to adjust the fan speeds and perform other actions such as illuminating LEDs and powering off the chassis These sensors and their respective thresholds are as follows m Front panel ambient temperature fp t amb Upper non critical 30 degrees C U
127. lanplus H IPADDR U root P changeme sel get 0x0a00 SEL Record ID 0a00 Record Type 02 Timestamp 07 06 1970 01 53 58 Generator ID 0020 EvM Revision 04 Sensor Type Entity Presence Sensor Number 12 Event Type Generic Discrete Event Direction Assertion Event Event Data RAW O1ffff Description Device Present Sensor ID psO prsnt 0x12 Entity ID 10 0 Sensor Type Discrete Entity Presence States Asserted Availability State Device Present Chapter 4 Using IPMItool to View System Information 39 In the example above this event shows that Power Supply 0 is detected and present Clearing the SEL With IPMItool To clear the SEL type the sel clear command ipmitool I lanplus H IPADDR U root P changeme sel clear Clearing SEL Please allow a few seconds to erase Using the Sensor Data Repository SDR Cache When working with the ILOM SP certain operations can be expensive in terms of execution time and the amount of data transferred Typically issuing the sar elist command requires the entire SDR to be read from the SP Similarly the sel elist command needs to read both the SDR and the SEL from the SP in order to cross reference events and display useful information To speed up these operations it is possible to pre cache the static data in the SDR and feed it back into IPMItool This can have a dramatic effect in the processing time for some commands In order to generat
128. lation and become static thereafter m Channel number assigning skips empty channels controllers Note Only a fully populated system with all 48 disks is supported m Channel numbers change when a USB storage device is present at the time of OS installation m Solaris channel numbering with at least 1 HD per controller at install time All disks on Controller 0 have a c0 number All disks on Controller 1 have a c1 number Remote USB devices such as a CD ROM or floppy physically installed or present by mapping through JavaRConsole will be inserted after Controller 1 and have numbers c2 c3 etc All disks on Controller 2 will have a cx number where x is 1 the number of added USB devices For example on a system with 2 devices Controller 2 will have a c4 name and on a system with 3 devices Controller 2 have a c5 name All disks on remaining controllers will have a cx number where x is 1 more than the previous controller Book Title without trademarks or an abbreviated book title July 2009 Pre ILOM 2 0 2 5 and No USB Devices TABLE 7 4 Sun Fire X4500 Disk Mapping No USB Storage Devices Pre ILOM 2 0 2 5 USB CD USB USB ROM Floppy Device Controller 3 Controller 2 Controller 5 Controller 4 Controller 1 Controller 0 36 37 38 39 40 41 42 43 44 45 46 47 c5t3 JcSt7 c4t3 c4t7 je7t
129. le 40 degrees C m CPU 0 p0 t_core and CPU 1 p1 t_core die temperatures Upper non critical 55 degrees C Upper critical 65 degrees C Upper non recoverable 75 degrees C There are three other temperature sensors m I O board ambient temperature io t_amb m system controller ambient temperature mb t_amb m Power distribution board ambient temperature pdb t_amb V To View Sensor Readings 1 Log in to the SP as Administrator or Operator to reach the ILOM web GUI a Type the IP address of the server s SP into your web browser The Sun Integrated Lights Out Manager Login screen is displayed 26 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 b Type your user name and password When you first try to access the ILOM Service Processor you are prompted to type the default user name and password The default user name and password are Default user name root Default password changeme 2 From the System Monitoring tab select Sensor Readings The Sensor Readings page is displayed See FIGURE 3 3 FIGURE 3 3 Sensor Readings Page REFRESH LOG OUT Administr 970192 Sun Integrated Lights Out Manager Sun Microsystems Inc System Information System Monitoring Configuration User Management Remote Control Maintenance Sensor Readings Event Logs Locator Indicator Sensor Readings View readings for temperature voltage or fan sensors Select a sensor type cat
130. lf test POST messages and BIOS event logs during system startup Continue with Viewing Event Logs on page 159 118 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 CHAPTER 9 Using SunVTS Diagnostic Software This chapter contains information about the SunVTS diagnostic software tool This chapter includes the following topics m About SunVTS Diagnostic Software on page 119 m Running SunVTS Diagnostic Tests on page 120 About SunVTS Diagnostic Software Sun Fire X4540 servers are shipped with a Bootable Diagnostics CD that contains SunVTS Validation Test Suite software SunVTS provides a comprehensive diagnostic tool that tests and validates Sun hardware by verifying the connectivity and functionality of most hardware controllers and devices on Sun platforms SunVTS software can be tailored with modifiable test instances and processor affinity features The following tests are supported on x86 platforms The current x86 support is for the 32 bit operating system only m CD DVD Test cddvdtest m CPU Test cputest m Disk and Diskette Drives Test disktest m Data Translation Look Aside Buffer dtlbtest m Floating Point Unit Test fputest m Network Hardware Test nettest m Ethernet Loopback Test netlbtest m Physical Memory Test pmemtest m Serial Port Test serialtest 119 m System Test systest m Universal Serial Bus Test usbtest m Virtual Memory Test vmemtest
131. llowed required ox ox CPU Board LEDs The CPU board has three types of LEDs DIMM fault CPU fault and battery fault The CPU LEDs are active only when the Remind button is depressed CPU LEDs blink to indicate a failure otherwise they stay Off 178 Book Title without trademarks or an abbreviated book title July 2009 Note The CPU and DIMM LEDs continue to indicate a failure until the system is powered on The Battery LED continues to indicate a failure until the service processor is started Internal LEDs appear in FIGURE 14 6 and are listed in TABLE 14 3 FIGURE 14 6 CPU Module LED and Button Locations Figure Legend DIMMs 0213 CPU 1 under heatsink CPU 0 under heatsink DIMMs 3 1 20 DIMM fault LEDs CPU 1 fault LED oa fF WN Chapter 14 Identifying Status and Fault LEDs 179 Figure Legend 7 Battery fault LED 8 CPU 1 fault LED 9 DIMM fault LED TABLE 14 3 Internal LEDs Name Color Function 1 Disk Drives See FIGURE 14 5 Status Green Blinking data is transfering unit is OK Fault Amber Fault service action is required Ready to Remove Blue Unit is ready to remove Service action allowed 2 Fan Trays See FIGURE 14 5 Status Green Unit is OK Fault Amber Fault service action is required 3 CPU See FIGURE 14 6 LEDs are active only when the Remind button is pressed DIMM Failure Amber Blinks to indicate that the system has found a fault with the DIMM Restar
132. lowing command displays the SunFire X4500 hard drive controller number and the corresponding PCI device nodes This is useful in determining the HBA controller number based on the PCI device node from syslog messages hd j devices pciG0 O pci1022 745881 pcillab 11abG1 c0 devices pciG0 O pci1022 745882 pcillab 11abG1 cl devices pciG1 O pci1022 745883 pcillab 11abG1 c4 devices pciG1 O pci1022 745884 pcillab 11abG1 c5 devices pci 2 0 pcil022 7458 7 pcillab 1lab 1 c6 devices pci 2 0 pcil022 7458 8 pcillab 1lab 1 c7 OF WN PO Sun Fire X4500 Disk Mapping When you reinstall the OS the hard disk names change depending on the ILOM version and USB CD storages devices present at the time of OS installation Pre ILOM 2 0 2 5 and ILOM 2 0 2 5 and Later The following applies to the hard drive and device mapping for systems with pre ILOM 2 0 2 5 and ILOM 2 0 2 5 and later installed m Pre ILOM 2 0 2 5 Remote CD ROM is always mapped to c2 and remote floppy device is always mapped to c3 Chapter 7 hd Utility 81 82 m ILOM 2 0 2 5 or later Remote CD ROM and remote floppy devices are only mapped if enabled in the JavaRConsole If one device is enabled it will have the c2 number If both are enabled numbers c2 and c3 will both be used m Disk device path dev extydz where c is the Controller number t is the target number and d is the disk number m Channel numbers are assigned dynamically at the time of OS instal
133. m Configuration Display Disabled decrease the time Quiet Boot Disabled needed to boot the Language English system AddOn ROM Display Mode Force BIOS Bootup Num Lock On Wait For F1 If Error Disabled a Interrupt 19 Capture Disabled e Fe Ft Select Screen A SE Select Item kd RE EE Change Option EL General Help k F10 Save and Exit kai ESC Exit KR KKK KEK KKK KKK KK KKK kk KK KEK KKK KKK KKK KEK KKK KKK KKK KKK KEK KEK KE KKK KEK KEK KEK KKK KKK KKK KKEKKEKKEKEEKE 4 On the Boot Settings Configuration screen select options that you can enable or disable m Quick Boot This option is disabled by default If you enable this the BIOS skips certain tests while booting such as the extensive memory test This decreases the time it takes for the system to boot m System Configuration Display This option is disabled by default If you enable this the System Configuration screen is displayed before booting begins m Quiet Boot This option is disabled by default If you enable this the Sun Microsystems logo is displayed instead of POST codes m Language This option is reserved for future use Do not change m Add On ROM Display Mode This option is set to Force BIOS by default This option has effect only if you have also enabled the Quiet Boot option but it controls whether output from the Option ROM is displayed The two settings for this option ar
134. mponent Information on page 137 m Viewing Temperature Voltage and Fan Sensor Readings on page 139 For more information on using the ILOM SP GUI to maintain the server for example configuring alerts refer to the Sun Integrated Lights Out Manager User s Guide and supplement Connecting the SP to a Serial Port To make a serial connection to the service processor 1 Connect a serial cable from the RJ 45 Serial Management port on the server back panel to a terminal device 2 Press ENTER on the terminal device to establish a connection between the terminal device and the server ILOM SP 133 Note If you are connecting to the serial port on the SP before it has been powered up or during its power up sequence you will see bootup messages displayed The service processor displays a login prompt after a short wait For example SUNSP0003BA84D777 login The first string in the prompt is the default host name for the ILOM SP The host name consists of the prefix SUNSP and the unique MAC address of the ILOM SP 3 Log in to the SP When you first try to access the ILOM SP you are prompted to type the default user name and password Type the default user name and password Default user name root Default password changeme After you have successfully logged in to the SP the screen displays the default command prompt gt 4 To start the serial console type the following commands cd SP console star
135. mponents that are extremely sensitive to static electricity 10 11 12 13 Replace the CPU that has the problem Refer to the Sun Fire X4500 Server Service Manual 819 4359 Remove the DIMMs from the CPU board Refer to the Sun Fire X4500 Server Service Manual 819 4359 Visually inspect the DIMMs for physical damage dust or any other contamination on the connector or circuits Visually inspect the DIMM slot for physical damage Look for cracked or broken plastic on the slot Dust off the DIMMs clean the contacts and reseat them If there is no obvious damage exchange the individual DIMMs between the two slots of a given pair Ensure that they are inserted correctly with ejector latches secured Using the slot numbers from the example a Remove the DIMMs from CPUO slots 1 and 3 b Reinstall the DIMM from slot 1 into slot 3 c Reinstall the DIMM from slot 3 into slot 1 Reconnect AC power cords to the server Power on the server and run the diagnostics test again Review the log file 12 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 14 15 16 17 18 m If the error now appears in CPUO slot 3 opposite to the original error in slot 1 the problem is related to the individual DIMM In this case return both DIMMs the pair to the Support Center for replacement m If the error still appears in CPUO slot 1 as the original error did the problem is not related
136. n troubleshooting To gather service visit information 1 Collect information about the following items m Events that occurred prior to the failure 2 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 m Whether any hardware or software was modified or installed a Whether the server was recently installed or moved a How long the server exhibited symptoms a The duration or frequency of the problem Document the server settings before you make any changes If possible make one change at a time in order to isolate potential problems In this way you can maintain a controlled environment and reduce the scope of troubleshooting Note the results of any change that you make Include any errors or informational messages Check for potential device conflicts before you add a new device Check for version dependencies especially with third party software Troubleshooting Power Problems If the server will not power on 1 Check that AC power cords are attached firmly to the server s power supplies and to the AC sources Use of the cable clamps will ensure that the AC power cords are attached to the server s power supplies Check that the component covers are firmly in place Including the hard disk drive access cover system controller cover and fan access cover An intrusion switch on the system controller shuts the server down when the hard disk drive access cover is removed Inv
137. n Disabled k k eK Select Screen Fe A Select Item Enter Go to Sub Screen pi General Help k F10 Save and Exit ki ESC Exit k KRAKKKKKK KKK KKK KKK KK KK KK KR KKK KR A KEK KR AA KK AH KK KEKE KKK KEK KKK KKK KEK KKKKKKKK KKK KKK KKK KKK KKK KKK c From the IPMI 2 0 Configuration screen select View BMC System Event Log The log takes about 60 seconds to generate then it is displayed on the screen 5 If the problem with the server is not evident continue with Using the ILOM Service Processor GUI to View System Information on page 49 or Using IPMItool to View System Information on page 61 Chapter 13 Event Logs and POST Codes 161 About Power On Self Test POST The system BIOS provides a rudimentary power on self test After power on POST does the following tasks m Checks the basic devices required for the server to operate m Tests memory tests the LSI SAS1068E disk controllers m Probes and enumerates the attached disks m Initializes the two Intel dual gigabit Ethernet controllers The progress of the self test is indicated by a series of POST codes These codes are displayed at the bottom right corner of the system s VGA screen after the self test has progressed far enough to initialize the system video However the codes are displayed as the self test runs and scroll off of the screen too quickly to be read see POST Codes on page 166
138. n Previous System Error Listed American Gy Sun Megatrends KA j www ami com microsystems BMC Firmware Revision 1 00 Checking NURAM Initializing USB Controllers Done Press F2 to run Setup CTRL E on Remote Keyboard Press F12 to boot from the network CTRL N on Remote Keyboard ISB Device s 3 Keyboards 3 Mice 2 Storage Devices Auto Detecting Pri Master ATAPI CDROM Pri Master DU 28SL 1 0A Ultra DMA Mode 2 Auto detecting USB Mass Storage Devices Device 01 AMI Virtual CDROM Device 02 AMI Virtual Floppy 2 USB mass storage devices found and configured 0085 BMC Respond ing A Hyper Transport sync flood error occurred on last boot PCI System Error SERR and HyperTransport Synch Flood Error are logged in DMI and the SP SEL See the following sample output SEL Record ID 0a00 Record Type 00 Timestamp 08 10 2005 06 05 32 Generator ID 0001 EvM Revision 04 Sensor Type Critical Interrupt Sensor Number 00 Appendix B Error Handling 99 Event Type Sensor specific Discrete Event Direction Assertion Event Event Data x OSEFEE Description PCI SERR m FIGURE B 6 shows an example DMI log screen from the BIOS Setup page with a system error FIGURE B 6 DMI Log Screen System Error Listed BIOS SETUP UTILITY View Event Log Mark all events as Clear Eve 9 12 05 14 23 47 ar Transpor 14 23 36 system Error J Jp wight 100 Book Title without trademarks
139. nd fan cover removed This appendix includes the following m External Status Indicator LEDs on page 61 m Exterior Features Controls and Indicators on page 62 m Internal Status Indicator LEDs on page 66 External Status Indicator LEDs See the following figures and tables for information about the LEDs that are viewable on the outside of the server m FIGURE 6 1 describes the front panel m FIGURE 6 2 and TABLE 6 1 describe the front panel and control indicators m TABLE 6 2 and TABLE 6 2 describe the rear panel m FIGURE 6 6 describe the LED and button location 61 Exterior Features Controls and Indicators This section shows and describes the features and the controls and indicators on the front and rear panels of the Sun Fire X4500 server Front Panel FIGURE 6 1 shows the front panel FIGURE 6 2 shows a close up of the controls and indicators TABLE 6 1 lists and describes the controls and indicators FIGURE 6 1 Sun Fire X4500 Server Front Panel LEDs 62 1 2 L L onmo A Book Title without trademarks or an abbreviated book title July 2009 FIGURE 6 2 Sun Fire X4500 Server Front Panel Controls and Indicators 1 2 3 4 5 6 7 TABLE 6 1 Front Panel Controls and Indicators Name Color Description 1 Locate White Operators can turn this LED On remotely to help then button LED locate the server in a crowded server room Press to turn off Pressing the Loc
140. nk frequencies and widths to their final values 000a Initializing the 8042 compatible Keyboard Controller 000c Detecting the presence of Keyboard in KBC port 000e Testing and initialization of different Input Devices Traps the INTO9h vector so that the POST INTO9h handler gets control for IRQ1 8600 Preparing CPU for booting to OS by copying all of the context of the BSP to all application processors present NOTE APs are left in the CLI HLT state de00 Preparing CPU for booting to OS by copying all of the context of the BSP to all application processors present NOTE APs are left in the CLI HLT state 8613 Initialize PM regs and PM PCI regs at Early POST Initialize multi host bridge if system supports it Setup ECC options before memory clearing Enable PCI X clock lines in the 8131 0024 Uncompress and initialize any platform specific BIOS modules 862a BBS ROM initialization 002a Generic Device Initialization Manager DIM Disable all devices 042a ISA PnP devices Disable all devices 052a PCI devices Disable all devices 122a ISA devices Static device initialization 152a PCI devices Static device initialization 252a PCI devices Output device initialization 202c Initializing different devices Detecting and initializing the video adapter installed in the system that has optional ROMs 002e Initializing all the output devices 0033 Initializing the silent boot module Set the window for displaying text informa
141. nstructions are optional but you can use them to change the operations that the server performs during POST testing V To Change POST Options 1 Initialize the BIOS Setup utility by pressing the F2 key while the system is performing the power on self test POST The BIOS Main menu screen is displayed 2 Select Boot The Boot Settings screen is displayed Advanced PCIPnP Boot Security kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkxkk Boot Settings kkkkxkxkkxkkxkxkxkkxkkxkxkxkkkkkkkkkkkkxkxkxkkkkxkxkxkkxkkxkxkkkkkxkkkk OK during System Boot Boot Settings Configuration F F E Boot Hard 1 Device Priority Disk Drives Removable Drives CD DV D Drives Chipset Exit Configure Settings KER Select Screen EEE Select Item Enter Go to Sub Screen FI General Help F10 Save and Exit ESC Exit Chapter 5 Event Logs and POST Codes kai KREKKKKKKKKKKKKKK KK KKK KKK KKK KK KKK KKK KKK KK KK KEK KKK KKK KKK KK KKK KKK KK KKKKKKKKKKKKKKEKE 53 3 Select Boot Settings Configuration The Boot Settings Configuration screen is displayed B d DRA A VA a haa teeta lg ee tice lah er ton ar th cra oes oa Oe Boot Settings Configuration Allows BIOS to skip ko kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk OK certain tests while Quick Boot Disabled booting This will Syste
142. o boot The SP controls the system reset so the Not logged Fatal upon application of system may power on but will not come out system power of reset e During power up the SP s boot loader turns on the power LED e During SP boot Linux startup and SP sanity check the power LED blinks e The LED is turned off when SP management code the IPMI stack is started e At exit of BIOS POST the LED goes to STEADY ON state SP failure SP boots but fails The SP controls the system RESET so the Not logged Fatal POST system will not come out of reset BIOS POST Server BIOS does There are fatal and non fatal errors in POST failure not pass POST The BIOS does detect some errors that are announced during POST as POST codes on the bottom right corner of the display on the serial console and on the video display Some POST codes are forwarded to the SP for logging The POST codes do not come out in sequential order and some are repeated because some POST codes are issued by code in add in card BIOS expansion ROMs In the case of early POST failures for example the BSP fails to operate correctly BIOS just halts without logging For some other POST failures subsequent to memory and SP initialization the BIOS logs a message to the SP s SEL 196 Book Title without trademarks or an abbreviated book title July 2009 TABLE D 1 Hardware Error Handling Summary Continued Logged DMI Error Description Handling Log or SP SEL Fatal
143. ock will be in UTC Via the CLI ILOM web GUI and IPMI Chapter 3 Using the ILOM Service Processor GUI to View System Information 23 Viewing Replaceable Component Information Depending on the component you select information about the manufacturer component name serial number and part number can be displayed To view replaceable component information 1 Log in to the SP as Administrator or Operator to reach the ILOM web GUI a Type the IP address of the server s SP into your web browser The Sun Integrated Lights Out Manager Login screen is displayed b Type your user name and password When you first try to access the ILOM Service Processor you are prompted to type the default user name and password The default user name and password are Default user name root Default password changeme 2 From the System Information tab select Components The Replaceable Component Information page is displayed See FIGURE 3 2 24 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 FIGURE 3 2 Replaceable Component Information Page ABOUT User roc S 0 L DOC B6 un Integrated Lights Out Manager System Information System Configuration User Management Remote Control Versions Session Time Out Components l gt Java Sun Microsystems Inc Replaceable Component Information View component part numbers serial numbers and manufacturing information Select a device OO
144. ommand man ipmitool 32 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 Connecting to the Server With IPMItool To connect over a remote interface you must supply a user name and password The default user with administrator level access is root with password changeme You must use the U and P parameters to pass both user name and password on the command line as shown in the following example ipmitool I lanplus H IPADDR U root P changeme chassis status Note If you experience command syntax problems with your particular operating system you can use the ipmitool h command and parameter to determine which parameters can be passed with the ipmitool command on your operating system Also refer to the IPMItool man page by typing man ipmitool Note In the example commands shown in this appendix the default username root and default password changeme are shown You should type the user name and password that has been set for the server Enabling the Anonymous User In order to enable the Anonymous NULL user you must alter the privilege level on that account Altering the privilege level lets you connect without supplying a U user option on the command line The default password for this user is anonymous To enable the anonymous user type the following commands ipmitool I lanplus H lt IPADDR gt U root P changeme channel setaccess 1 1 privilege 4 ipmitool I lanplus H IPADDR
145. onfigurations TABLE 10 1 lists the supported DIMM configurations for the Sun Fire Sun Fire X4500 X4540 Servers server TABLE 10 1 Supported DIMM Configurations Slot 3 Slot 2 Slot 1 Slot 0 Total Memory Per CPU 0 2 GB 0 2 GB 4 GB 2 GB 2 GB 2 GB 2 GB 8 GB 4 GB 4 GB 4 GB 4 GB 16 GB DIMM Replacement Policy Replace a DIMM when one of the following events takes place m The DIMM fails memory testing under BIOS due to Uncorrectable Memory Errors UCEs m UCEs occur and investigation shows that the errors originated from memory In addition a DIMM should be replaced whenever more than 24 Correctable Errors CEs originate in 24 hours from a single DIMM and no other DIMM is showing further CEs m If more than one DIMM has experienced multiple CEs other possible causes of CEs have to be ruled out by a qualified Sun Support specialist before replacing any DIMMs Retain copies of the logs showing the memory errors per the above rules to send to Sun for verification prior to calling Sun 124 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 How DIMM Errors Are Handled by the System This section describes system behavior for the two types of DIMM errors UCEs Uncorrectable Errors and CEs Correctable Errors This section also describes BIOS DIMM error messages Uncorrectable DIMM Errors In all operating systems OS s the behavior is the same for UCEs 1 When an UCE occurs the memory controller
146. or ACPI 00 Prepares CPU for booting to OS by copying all of the context of the BSP to all application processors present NOTE APs are left in the CLIHLT state 61 70 OEM POST Error This range is reserved for chipset vendors and system manufacturers The error associated with this value may be different from one platform to the next 170 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 CHAPTER 1 4 Identifying Status and Fault LEDs This appendix contains information about the external and internal LEDs on the Sun Fire Sun Fire X4500 X4540 Servers server This chapter includes the following topics m Front Panel Features on page 172 m Rear Panel Features on page 174 m Internal Status Indicator LEDs on page 175 Sections describe the controls and indicators on the front and rear panels of the Sun Fire X4540 server These sections describe external status LEDs that you can see from the outside of the server Additional sections describe internal status and fault LEDs that can only be viewed with the hard disk drive cover system controller cover and fan cover removed The following figures and tables describe the features and status indicator LEDs that are visible outside of the server m FIGURE 14 1 Sun Fire X4540 Server Front Panel Features on page 172 m FIGURE 14 2 Sun Fire X4540 Server Front Panel Controls and Indicators on page 173 and TABLE 14 1 Front Panel Features on
147. or LEDs 67 oooooooo ooooo o fe The CPU LEDs are active only when the Remind button is depressed They blink to indicate a failure otherwise they stay Off Note The CPU and DIMM LEDs continue to indicate a failure until the system is powered on The Battery LED continues to indicate a failure until the service processor is started CPU Module LED and Button Locations FPP PREP RARA SARA PRP PPR PRR 50806 6950098000095 60 00000000000000000000 0 0000000000000000000 FIGURE 6 6 CPU 1 under heatsink CPU 0 under heatsink DIMM 0213 1 3 2 Figure Legend Book Title without trademarks or an abbreviated book title July 2009 68 Figure Legend 4 DIMM3120 DIMM fault LEDs CPU 1 fault LED Battery Battery fault LED CPU 0 fault LED Press to see fault DIMM fault LED o ON ODO UO Oo Chapter 6 Status Indicator LEDs 69 70 Book Title without trademarks or an abbreviated book title July 2009 CHAPTER 7 hd Utility This appendix includes information about the following topics m Overview of the hd Utility on page 71 m Using the hd Utility on page 73 m hd Command Options and Parameters on page 73 m Sun Fire X4500 Disk Mapping on page 81 Overview of the hd Utility The Sun Fir
148. or installed m Whether the server was recently installed or moved a How long the server has exhibited symptoms m The duration or frequency of the problem Document the exisiting server settings before you make any changes Record the BIOS version software version and server serial numbers Check the product notes to view issues associated with the server hardware and software Adjust the exisiting server settings to correct the problem If possible make one change at a time in order to isolate potential problems Use this method to maintain a controlled environment and reduce troubleshooting Note the changes made and results of any change you make Include any errors or informational messages Check for potential device conflicts before you add a new device Check for version dependencies especially with third party software If the problem is not evident continue with the next section Troubleshooting Power Problems on page 111 Troubleshooting Power Problems Do one of the following m If the server can power on skip this section and proceed to External Inspection of the Server on page 112 m If the server can not power on do the following procedure Chapter 8 Initial Inspection of the Server 111 1 Check that AC power cords are attached firmly to the server s power supplies and to the AC sources Use of the cable clamps will ensure that the AC power cords are attached to the server
149. ostic tool that tests and validates Sun hardware by verifying the connectivity and functionality of most hardware controllers and devices on Sun platforms SunVTS software can be tailored with modifiable test instances and processor affinity features The following tests are supported on x86 platforms The current x86 support is for the 32 bit operating system only m CD DVD Test cddvdtest m CPU Test cputest m Disk and Diskette Drives Test disktest m Data Translation Look Aside Buffer dtlbtest m Floating Point Unit Test fputest m Network Hardware Test nettest m Ethernet Loopback Test netlbtest a Physical Memory Test pmemtest m Serial Port Test serialtest m System Test systest m Universal Serial Bus Test usbtest m Virtual Memory Test vmemtest Sun VTS software has a sophisticated graphical user interface GUI that provides test configuration and status monitoring The user interface can be run on one system to display the SunVTS testing of another system on the network SunVTS software also provides a TTY mode interface for situations in which running a GUI is not possible SunVTS Documentation For the most up to date SunVTS documentation go to http docs sun com app docs coll 1140 2 16 Diagnosing Server Problems With the Bootable Diagnostics CD Sun VTS 6 2 or later software is preinstalled on these Sun Fire X4500 servers The server is also shipped with the Sun Fire X4500 Server Boot
150. ous user list Changing the Default Password You can also change the default passwords for a particular user ID First get a list of users and find the ID for the user you wish to change Then supply it with a new password as shown in the following command sequence ipmitool I lanplus H IPADDR U root P changeme user list ID NameCallin Link Auth IPMI Msg Channel Priv Limit 1 false false true NO ACCESS 2 root false false true ADMINISTRATOR ipmitool I lanplus H IPADDR U root P changeme user set password 2 newpass ipmitool I lanplus H IPADDR U root P newpass chassis status Configuring an SSH Key You can use IPMItool to configure an SSH key for a remote shell user To do this first determine the user ID for the desired remote SP user with the user list command Chapter 12 Using IPMItool to View System Information 145 ipmitool I lanplus H IPADDR U root P changeme user list Then supply the user ID and the location of the RSA or DSA public key to use with the ipmitool sunoem sshkey command For example ipmitool I lanplus H IPADDR U root P changeme sunoem sshkey set 2 id rsa pub Setting SSH key for user id 2 done You can also clear the key for a particular user for example ipmitool I lanplus H IPADDR U root P changeme sunoem sshkey del 2 Deleted SSH key for user id 2 Using IPMItool to Read Sensors For more information about supported IPMI 2 0 commands and the sen
151. plays an error message logs the DMI Log Non fatal CMOS failed the error to DMI and boots Checksum Checksum check Bad Unsupported The BIOS supports The BIOS displays an error message logs the DMI Log Fatal CPU mismatched error and halts the system configuration frequency and steppings in CPU configuration but some CPUs might not be supported Correctable The CPU detectsa The CPU corrects the error in hardware No DMI Log Normal error variety of interrupt or machine check is generated by SP SEL operation correctable errors in the hardware The polling is triggered every the MCi_STATUS half second by SMI timer interrupts and is registers done by the BIOS SMI handler The SMI handler logs a message to the SP SEL if the SEL is available otherwise SMI logs a message to DMI The BIOS s polling can be disabled through software SMI Single fan Fan failure is The Front Fan Fault Service Action Required SP SEL Non fatal failure detected by reading and individual fan module LEDs are lit tach signals 198 Book Title without trademarks or an abbreviated book title July 2009 TABLE D 1 Hardware Error Handling Summary Continued Logged DMI Error Description Handling Log or SP SEL Fatal Multiple fan Fan failure is The Front Fan Fault Service Action Required SP SEL Fatal failure detected by reading and individual fan module LEDs are lit tach signals Single power When any of the Service Action Required and Power Supply SP SEL Non
152. ply sensor m BIOS generated events These events relate to error messages generated in the BIOS m System management software events These events relate to events that occur within the ILOM software After you have selected a category of event the Event Log table is updated with the specified events The fields in the Event Log are described in TABLE 11 1 TABLE 11 1 Event Log Fields Field Description Event ID The number of the event in sequence from number 1 Time Stamp The day and time the event occurred If the Network Time Protocol NTP server is enabled to set the SP time the SP clock will use Universal Coordinated Time UTC For more information about time stamps see Interpreting Event Log Time Stamps on page 137 Sensor Name The name of a component for which an event was recorded The sensor name abbreviations correspond to the following components sys System or chassis e p0 Processor 0 e pl Processor 1 e io I O board e ps Power supply e fp Front panel e ft Fan tray e mb Motherboard Sensor Type The type of sensor for the specified event Description A description of the event 4 To clear the event log click the Clear Event Log button A confirmation dialog box is displayed 5 Click OK to clear all entries in the log 6 If the problem with the server is not evident after viewing ILOM SP logs and information continue with Running SunVTS Diagnostic Tests on page 120
153. pper critical 35 degrees C Upper non recoverable 40 degrees C m CPU 0 p0 t_core and CPU 1 p1 t_core die temperatures Upper non critical 55 degrees C Upper critical 65 degrees C Upper non recoverable 75 degrees C There are three other temperature sensors m I O board ambient temperature io t_amb m system controller ambient temperature mb t_amb m Power distribution board ambient temperature pdb t_amb To View Sensor Readings 1 Log in to the SP as Administrator or Operator to reach the ILOM web GUI a Type the IP address of the server s SP into your web browser The Sun Integrated Lights Out Manager Login screen is displayed Chapter 11 Using the ILOM Service Processor GUI to View System Information 139 b Type your user name and password When you first try to access the ILOM Service Processor you are prompted to type the default user name and password Type the default user name and password Default user name root Default password changeme 2 From the System Monitoring tab select Sensor Readings The Sensor Readings page is displayed See FIGURE 11 3 FIGURE 11 3 Sensor Readings Page ABOUT REFRESH LOG OUT roo mi r er o 00970192 Sun Integrated Lights Out Manager System Information System Monitoring Configuration User Management Remote Control Maintenance Sensor Readings Event Logs Locator Indicator Sensor Readings View readings for temperature voltage o
154. ptor and a status reading sensor and the two are linked that is if you use the led sensor to turn on a particular LED then the status change is represented in the associated fail sensor Also for some of these an event is generated in the SEL For LEDs that blink on failure instead of steady on the events are not generated this is because it would display an event every time the LED flashed in the blink cycle TABLE 12 2 lists the LED sensor IDs in these servers See Identifying Status and Fault LEDs on page 171 for diagrams of the LED locations TABLE 12 2 LED Sensor IDs LED Sensor ID Description sys power led System Power front back sys locate led System Locate front back sys alert led System Alert front back sys psfail led System Power Supply Failed sys tempfail led System Over Temperature sys fanfail led System Fan Failed bp power led Back Panel Power bp locate led Back Panel Locate bp alert led Back Panel Alert fp power led Front Panel Power fp locate led Front Panel Locate fp alert led Front Panel Alert io hdd0 led Hard Disk 0 Failed io hddl led Hard Disk 1 Failed io hdd2 led Hard Disk 2 Failed 154 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 TABLE 12 2 LED Sensor IDs Continued LED Sensor ID Description io hdd3 led Hard Disk 3 Failed io f0 led I O Fan Failed p0 led CPU 0 Failed p0 d0 led CPU 0 DIMM 0 Failed p0 d1 led CPU 0 DIMM 1 Failed p0 d2 led CPU 0 DIMM
155. r IPMItool 33 145 B back panel figure 64 174 BIOS changing POST options 53 164 event logs 47 159 POST code checkpoints 57 167 POST codes 55 166 POST overview 50 162 redirecting console output for POST 51 163 Bootable Diagnostics CD 16 120 button NMI 65 reset 65 Cc comments and suggestions xii component inventory viewing with ILOM SP GUI 24 137 viewing with IPMItool 41 152 condition change functions options and operands 74 configurations for DIMMs 11 124 console output redirecting 51 163 correctable errors handling 94 188 D default password changing with IPMItool 34 145 diagnostic software Bootable Diagnostics CD 16 120 SunVTS 15 119 120 DIMMs error handling 7 125 fault LEDs 8 128 isolating errors 11 130 population rules 11 123 supported configurations 11 124 E emergency shutdown 4 115 error handling correctable 94 188 DIMMs 7 125 hardware errors 102 196 mismatching processors 101 195 parity errors 96 190 system errors 99 192 uncorrectable errors 91 185 event logs BIOS 47 159 external inspection 4 112 external LEDs 61 F faults DIMM 8 128 finding sensor names 40 152 front panel LED locations 62 113 172 front panel LED locations 63 173 FRU inventory viewing with ILOM SP GUI 24 137 viewing with IPMItool 41 152 201 G gathering service visit information 2 111 general troubleshooting guidelines 2 111 gracef
156. r within the ILOM software After you have selected a category of event the Event Log table is updated with the specified events The fields in the Event Log are described in TABLE 3 1 TABLE 3 1 Event Log Fields Field Description Event ID Time Stamp Sensor Name Sensor Type Description The number of the event in sequence from number 1 The day and time the event occurred If the Network Time Protocol NTP server is enabled to set the SP time the SP clock will use Universal Coordinated Time UTC For more information about time stamps see Interpreting Event Log Time Stamps on page 23 The name of a component for which an event was recorded The sensor name abbreviations correspond to the following components sys System or chassis e p0 Processor 0 e pl Processor 1 e io I O board e ps Power supply e fp Front panel e ft Fan tray e mb Motherboard The type of sensor for the specified event A description of the event 4 To clear the event log click the Clear Event Log button A confirmation dialog box is displayed 5 Click OK to clear all entries in the log 6 If the problem with the server is not evident after viewing ILOM SP logs and information continue with Running SunVTS Diagnostic Tests on page 120 22 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 Interpreting Event Log Time Stamps The system event log time stamps are related to the servi
157. r fan sensors Select a sensor type category an Sensors yi Sensor Readings 77 sensors Status Name Reading State Asserted sys id 2 State Asserted sys intsw 0 Predictive Failure Deasserted sys psfail 1 Predictive Failure Deasserted sys tempfail 1 Predictive Failure Deasserted sys fanfail 1 Normal mb t_amb 24 degrees C Normal mb v_bat 3 232 Volts Normal mb v_ 3v3stby 3 217 Volts Unknown mb v_ 3v3 Not Available Unknown mb v_ 5v Not Available _Show Thresholds 3 Select the type of sensor readings that you want to view from the drop down menu You can select All Sensors Temperature Sensors Voltage Sensors or Fan Sensors 140 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 The sensor readings are displayed The Sensor Readings fields are described in TABLE 11 2 TABLE 11 2 Sensor Readings Fields Field Description Status Name Reading Reports the status of the sensor including State Asserted State Deasserted Predictive Failure Device Inserted Device Present Device Removed Device Absent Unknown and Normal Reports the name of the sensor The names correspond to the following components e sys System or chassis e bp Back panel e fp Front panel e mb Motherboard e io I O board e p0 Processor 0 e p1 Processor 1 e ft0 Fan tray 0 e ftl Fan tray 1 e pdb Power distribution board e ps0 Power supply 0 e ps1 Power supply 1 Reports the
158. ral Help F10 Save and Exit ESC Exit v02 53 C Copyright 1985 2002 American Megatrends Inc Appendix B Error Handling 93 Handling of Correctable Errors This section lists facts and considerations about how the server handles correctable errors m During BIOS POST m The BIOS polls the MCK registers a The BIOS logs to DMI a The BIOS logs to the SP SEL through the BMC m The feature is turned off at OS boot time by default m Solaris support provides full self healing and automated diagnosis for the CPU and Memory subsystems m FIGURE B 2 shows an example of a DMI log screen from BIOS Setup Page FIGURE B 2 DMI Log Screen Correctable Error BIOS SETUP UTILITY View Event Log 09 12 05 12 33 16 ECC on Node 1 DIMM Pair 0 SPD address OAGh OAZh WA Y YA 6 j3 1b Single Bit ECC Memory Error m If during any stage of memory testing the BIOS finds itself incapable of reading or writing to the DIMM it takes the following actions a The BIOS disables the DIMM as indicated by the Memory Decreased message in the example in FIGURE B 3 a The BIOS logs an SEL record 94 Book Title without trademarks or an abbreviated book title July 2009 m The BIOS logs an event in DMI FIGURE B 3 DMI Log Screen Correctable Error Memory Decreased BIOS SETUP UTILITY 09 12 05 13 30 00 Memory decreased in 09 12 05 13 29 54 ECC on Node 1 DIMM Pair 0 SPD addres 09 12 05 13 29 54 Memory Error Single Bit ECC
159. rales pour les produits export s conform ment la l gislation am ricaine en mati re d exportation Sauf autorisation par les autorit s des Etats Unis l utilisation d unit s centrales pour proc der des mises jour de produits est rigoureusement interdite Si Ca Adobe PostScript Contents Preface xi Part I Sun Fire X4500 Server Diagnostics Guide 1 Initial Inspection of the Server 1 Service Visit Troubleshooting Flowchart 1 Gathering Service Visit Information 2 Troubleshooting Power Problems 3 Externally Inspecting the Server 4 Internally Inspecting the Server 4 Troubleshooting DIMM Problems 6 How DIMM Errors Are Handled By the System 7 Uncorrectable DIMM Errors 7 Correctable DIMM Errors 7 BIOS DIMM Error Messages 8 DIMM Fault LEDs 8 DIMM Population Rules 11 Supported DIMM Configurations 11 Isolating and Correcting DIMM ECC Errors 11 2 Using SunVTS Diagnostic Software 15 Running SunVTS Diagnostic Tests 15 SunVTS Documentation 16 Diagnosing Server Problems With the Bootable Diagnostics CD 16 Requirements 16 Using the Bootable Diagnostics CD 16 Using the ILOM Service Processor GUI to View System Information 19 Making a Serial Connection to the SP 20 Viewing ILOM SP Event Logs 21 Interpreting Event Log Time Stamps 23 Viewing Replaceable Component Information 24 Viewing Temperature Voltage and Fan Sensor Readings 26 v To View Sensor Readings 26 Using IPMItool to View System Information 31 About IPMI 32
160. re CPU for OS boot including final MTRR values A9 Wait for user input at config display if needed Chapter 5 Event Logs and POST Codes 59 TABLE 5 2 POST Code Checkpoints Continued Post Code Description AA Uninstall POST INT1Ch vector and INTO9h vector Deinitializes the ADM module AB Prepare BBS for Int 19 boot AC Any kind of Chipsets NB SB specific programming needed during End POST just before giving control to runtime code booting to OS Programmed the system BIOS OFO0000h shadow RAM cacheability Ported to handle any OEM specific programming needed during End POST Copy OEM specific data from POST_DSEG to RUN_CSEG BI Save system context for ACPI 00 Prepares CPU for booting to OS by copying all of the context of the BSP to all application processors present NOTE APs are left in the CLIHLT state 61 70 OEM POST Error This range is reserved for chipset vendors and system manufacturers The error associated with this value may be different from one platform to the next 60 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 CHAPTER 6 Status Indicator LEDs This appendix contains information about the locations and behaviors of the status and fault LEDs on the server The information is organized to describe external LEDs that can be viewed on the outside of the server and internal LEDs that can be viewed only with the component covers including hard disk drive cover system controller cover a
161. reen select Advanced The Advanced Settings screen is displayed Main Advanced PCIPnP Boot Security Chipset Exit kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkxkk Advanced Settings Options for CPU iy k kxkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkxk OK WARNING Setting wrong values in below sections y ki may cause system to malfunction F 7 CPU Configuration 7 ki IDE Configuration x 8 SuperIO Configuration ACPI Configuration Event Log Configuration x Hyper Transport Configuration IPMI 2 0 Configuration X MPS Configuration AK Select Screen PCI express Configuration AMD PowerNow Configuration Ho eR Select Item Remote Access Configuration Enter Go to Sub Screen USB Configuration FI General Help F10 Save and Exit ai ki ESC Exit k kxkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkxkk 48 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 b From the Advanced Settings screen select Event Log Configuration The Advanced Menu Event Logging Details screen is displayed Advanced KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK Event Logging details View all unread events ko KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK OK on the Event Log kai View Event Log li a Mark all events as read ki
162. rom dev mcelog to var log messages a If HERD is not installed a program called mcelog copies messages from dev mcelog to var log mcelog The Bootable Diagnostics CD described in Using SunVTS Diagnostic Software also captures and logs CEs BIOS DIMM Error Messages The BIOS displays and logs the following DIMM error messages NODE n Memory Configuration Mismatch The following conditions will cause this error message a The DIMMs mode is not paired running in 64 bit mode instead of 128 bit mode a The DIMMs speed is not same a The DIMMs do not support ECC a The DIMMs are not registered The MCT stopped due to errors in the DIMM Chapter 10 Troubleshooting DIMM Problems 127 The DIMM module type buffer is mismatched m The DIMM generation I or II is mismatched a The DIMM CL T is mismatched a The banks on a two sided DIMM are mismatched a The DIMM organization is mismatched 128 bit m The SPD is missing Tre or Trfc information DIMM Fault LEDs When you press the Press to See Fault button on the motherboard or the mezzanine board LEDs next to the DIMMs flash to indicate that the system has detected 24 or more CEs in a 24 hour period on that DIMM Note The DIMM Fault and Motherboard Fault LEDs operate on stored power for up to a minute when the system is powered down even after the AC power is disconnected and the motherboard or mezzanine board is out of the system The stored pow
163. rs are not the same or Checksum is mismatched m NODE n DIMMs Manufacturer Mismatch The following conditions display this error message a DIMMs manufacturer is not supported Only Samsung Micron Infineon and SMART DIMMs are supported DIMM Fault LEDs In the Sun Fire X4500 server there are eight DIMM slots on the CPU board The server has an internal status LEDs for the CPU board DIMM and CPU fault LEDs on the CPU board provide further indications of which component has a fault condition These CPU and DIMM fault LEDs can be lit for up to one minute by a capacitor on the CPU board even after the CPU board is removed from the server To light the fault LEDs from the capacitor push the small button on the CPU board labeled Press to see fault See FIGURE 1 3 for the LED and button locations Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 The DIMM ejector levers contain LEDs that can indicate a faulty DIMM DIMM fault LED is off The DIMM is operating properly DIMM fault LED is on amber The DIMM is faulty and should be replaced The CPU fault LED can indicate a faulty CPU on CPU 0 or CPU 1 CPU fault LED is off The CPU is operating properly CPU fault LED is on amber The CPU is faulty and should be replaced Battery Fault LED is on amber The battery is faulty and should be replaced Note The CPU fault and DIMM LEDs continue to indicate a failure until the system is powered up The Bat
164. rver Front konn hd Command Options and Parameters The hd utility makes a distinction between controllers slots and storage devices that are physically present in the machine and visible to the Solaris OS Chapter 7 hd Utility 73 The hd command provides configuration and status information about the Sun Fire X4500 server s hard drives by using specific command options and parameters These options and parameters can be combined to display the information of your choosing Some of the options available include displaying color mode c summary s diagnose d identifying platform type p and obtaining configuration and status help messages h FIGURE 7 3 shows a complete listing of hd Utility commands hd Man page FIGURE7 3 Sample hd Utility Man page c olor mode s ummary p latform b ypass to print SunFireX4500 map d iagnose f syslog file w pci drive path m adjacent cross front2back diagonal Mapping pairs h elp a fdisk partition type l q list SunFireX4500 with index in seQuential list l g list drive slot number in seQuential list with temperature 1 List SunFireX4500 available disk in physical orders r List SMART data for all disks in drive slot number R List SMART data s indivdual id in landscape view for all disks e cXtY List SMART data for specified disk j List SunFireX450
165. ry from beginning of list SP can control or change boot order Appendix B Error Handling 105 106 Book Title without trademarks or an abbreviated book title July 2009 PART II Sun Fire X4540 Server Diagnostics Guide This part contains the Sun Fire X4540 Server Diagnostics Guide and has the following chapters Initial Inspection of the Server on page 8 109 Using SunVTS Diagnostic Software on page 9 119 Troubleshooting DIMM Problems on page 10 123 Using the ILOM Service Processor GUI to View System Information on page 11 133 Using IPMItool to View System Information on page 12 143 Event Logs and POST Codes on page 13 159 Identifying Status and Fault LEDs on page 14 171 Sun Fire X4540 Sensor Locations on page C 181 Error Handling on page D 185 CHAPTER 8 Initial Inspection of the Server This chapter includes the following topics m Service Visit Troubleshooting Flowchart on page 109 m Gathering Service Visit Information on page 111 m Troubleshooting Power Problems on page 111 m External Inspection of the Server on page 112 m Internal Inspection of the Server on page 115 Service Visit Troubleshooting Flowchart Use the following flowchart as a guideline for using this guide to troubleshoot the Sun Fire Sun Fire X4500 X4540 Servers server 109 FIGURE 8 1 Troubleshooting Flowchart To perform this task Gat
166. s and other software documentation see the following URL http docs sun com Sun Welcomes Your Comments Sun is interested in improving its documentation and welcomes your comments and suggestions You can submit your comments by going to http www sun com hwdocs feedback Please include the title and part number of your document with your feedback Sun Fire X4500 X4540 Servers Diagnostics Guide part number 819 4363 12 xii Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 PART I Sun Fire X4500 Server Diagnostics Guide This part contains the Sun Fire X4500 Server Diagnostics Guide and has the following chapters Initial Inspection of the Server on page 1 1 Using SunVTS Diagnostic Software on page 2 15 Using the ILOM Service Processor GUI to View System Information on page 3 19 Using IPMItool to View System Information on page 4 31 Event Logs and POST Codes on page 5 47 Status Indicator LEDs on page 6 61 hd Utility on page 7 71 Sun Fire X4500 Sensor Locations on page A 87 Error Handling on page B 91 CHAPTER 1 Initial Inspection of the Server This chapter includes the following topics Service Visit Troubleshooting Flowchart on page 1 Gathering Service Visit Information on page 2 Troubleshooting Power Problems on page 3 Externally Inspecting the Server on page 4 Internally Inspecting the Server on pag
167. s full Full sensor records Temperature voltage and fan sensors compact Compact sensor records Digital Discrete failure and presence sensors event Event only records Sensors used only for matching with SEL records mcloc MC locator records Management Controller sensors generic Generic locator records Generic devices LEDs fru FRU locator records FRU devices For example to see only the temperature voltage and fan sensors type the following command with the full argument ipmitool I lanplus H IPADDR U root P changeme sdr elist full fp t amb OAh ok 12 0 22 degrees C ps t_amb 11h ok 10 0 21 degrees C ps0 0 speed 15h ok 10 0 11000 RPM psl 0 speed 19h ok 10 1 0 RPM mb t_amb 1Ah ok 7 0 25 degrees C mb v_bat 1Bh ok 7 0 3 18 Volts mb v_ 3v3stby ich ok 7 0 3 17 Volts mb v_ 3v3 1Dh ok 7 0 3 34 Volts mb v_ 5v 1Eh ok 7 0 5 04 Volts mb v 12v 1Fh ok 7 0 12 22 Volts mb v_ 12v 20h ok 7 0 12 20 Volts mb v_ 2v5core 21h ok 7 0 2 54 Volts mb v lv8core 22h ok 7 0 1 83 Volts mb v lv2core 23h ok 7 0 1 21 Volts io t_amb 24h ok 15 0 21 degrees C pO t core 2Bh ok 3 0 44 degrees C pO v 1v5 2Ch ok 3 0 1 56 Volts p0 v_ 2v5core 2Dh ok 3 0 2 64 Volts p0 v_ 1v25core 2Eh ok 3 0 1 32 Volts pl t_core 34h ok 3 1 40 degrees C pl v 1v5 35h ok 3 1 1 55 Volts pl v_ 2v5core 36h ok 3 1 2 64 Volts 36 Sun Fire X4500 X4540 Servers Diagnostics
168. s 4 bit error det corr of x4 type memory Enable PCI X clock lines in the 8131 168 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 TABLE 13 2 POST Code Checkpoints Continued Post Code Description 20 Relocate all the CPUs to a unique SMBASE address The BSP will be set to have its entry point at A000 0 If less than 5 CPU sockets are present on a board subsequent CPUs entry points will be separated by 8000h bytes If more than 4 CPU sockets are present entry points are separated by 200h bytes CPU module will be responsible for the relocation of the CPU to correct address NOTE APs are left in the INIT state 24 Uncompress and initialize any platform specific BIOS modules 30 Initializes System Management Interrupt 2A Initializes different devices through DIM 2C Initializes different devices Detects and initializes the video adapter installed in the system that have optional ROMs 2E Initializes all the output devices 31 Allocate memory for ADM module and uncompress it Give control to ADM module for initialization Initializes language and font modules for ADM Activate ADM module 33 Initializes the silent boot module Sets the window for displaying text information 37 Displaying sign on message CPU information setup key message and any OEM specific information 38 Initializes different devices through DIM 39 Initializes DMAC 1 and DMAC 2 3A Initialize RTC date time 3B Test for total memory install
169. sage in the device status row It appears in yellow if the c option is used It prints the disk warning message which includes a timestamp indicating when the event happened E Allows you to specify any previous syslog file usually the var adm messages n with any disk warning messages m Maps the various possible pairs of drives for the Sun Fire X4500 server system This command option is useful for testing drive to drive interaction from one drive to another drive in separate locations in the Sun Fire X4500 server For performance and other file system software there are various ways to construct the pool of drives This option provides distinct pairings based on the current probed logical and physical maps in the system Supported map types are as follows e Adjacent Drive pairs that are on adjacent Marvell host controllers e Cross Drive pairs that are on alternate Marvell host controllers e Front2back Drive pairs that are on the front and back rows e Diagonal Drive pairs that are on diagonal locations w Translates Solaris OSraw storage PCI device path to cXtY device name as used by most of the applications h Provides help a Lists the fdisk 1m partition type This option scans the disks for fdisk partitions that are recognized by x64 Solaris OS Because the x64 platform also runs Linux and Windows some of the disks could have non Solaris fdisk partitions For example systems with dual booted operating systems q
170. solating and Correcting DIMM ECC Errors 130 Using the ILOM Service Processor GUI to View System Information 133 Connecting the SP to a Serial Port 133 Viewing ILOM SP Event Logs 134 Interpreting Event Log Time Stamps 137 Viewing Replaceable Component Information 137 Contents vii 12 13 Viewing Temperature Voltage and Fan Sensor Readings 139 To View Sensor Readings 139 Using IPMItool to View System Information 143 About IPMI 143 About IPMItool 144 IPMItool Man Page 144 Connecting to the Server With IPMItool 144 Enabling the Anonymous User 145 Changing the Default Password 145 Configuring anSSH Key 145 Using IPMItool to Read Sensors 146 Reading Sensor Status 146 Reading All Sensors 146 Reading Specific Sensors 147 Using IPMItool to View the ILOM SP System Event Log 149 Viewing the SEL With IPMItool 149 Clearing the SEL With IPMItool 151 Using the Sensor Data Repository SDR Cache 151 Sensor Numbers and Sensor Names in SEL Events 152 Viewing Component Information With IPMItool 152 Viewing and Setting Status LEDs 153 LED Sensor IDs 154 LED Modes 155 LED Sensor Groups 156 Using IPMItool Scripts for Testing 156 Event Logs and POST Codes 159 Viewing Event Logs 159 viii Undefined BookTitleFooter July 2009 14 About Power On Self Test POST 162 BIOS POST Memory Test Overview 162 Redirecting Console Output 163 Changing POST Options 164 v To Change POST Options 164 POST Codes 166 POST Code Checkpoints 167
171. sor information C2 Set up boot strap processor for POST This includes frequency calculation loading BSP microcode and applying user requested value for GART Error Reporting setup question C3 Errata workarounds applied to the BSP 78 amp 110 C5 Enumerate and set up application processors This includes microcode loading and workarounds for errata 78 110 106 107 69 63 C6 Re enable cache for boot strap processor and apply workarounds in the BSP for errata 106 107 69 and 63 if appropriate In case of mixed CPU steppings errors are sought and logged and an appropriate frequency for all CPUs is found and applied NOTE APs are left in the CLI HLT state C7 The HT sets link frequencies and widths to their final values This routine gets called after CPU frequency has been calculated to prevent bad programming OA Initializes the 8042 compatible Keyboard Controller OB Detects the presence of PS 2 mouse 0C Detects the presence of Keyboard in KBC port OE Testing and initialization of different Input Devices Also update the Kernel Variables Traps the INTO9h vector so that the POST INT09h handler gets control for IRQI Uncompress all available language BIOS logo and Silent logo modules 13 Initializes PM regs and PM PCI regs at Early POST Initializes multi host bridge if system support it Setup ECC options before memory clearing REDIRECTION causes corrected data to written to RAM immediately CHIPKILL provide
172. sor naming for this server also refer to the Integrated Lights Out Manager Administration Guide 819 1160 Reading Sensor Status There are a number of ways to read sensor status from a broad overview that lists all sensors to querying individual sensors and returning detailed information on them For information on the physical locations of the sensors in the system see Sun Fire X4540 Sensor Locations on page 181 Reading All Sensors To get a list of all sensors in these servers and their status use the sdr list command with no arguments This returns a large table with every sensor in the system and its status The five fields of the output lines as read from left to right are 1 IPMI sensor ID 16 character maximum 2 IPMI sensor number 146 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 3 Sensor status indicating which thresholds have been exceeded 4 Entity ID and instance 5 Sensor reading For example fp t amb OAh ok 12 0 22 degrees C Reading Specific Sensors Although the default output is a long list of sensors it is possible to refine the output to see only specific sensors The sdr list command can use an optional argument to limit the output to sensors of a specific type TABLE 12 1 describes the available sensor arguments TABLE 12 1 IPMItool Sensor Arguments Argument Description Sensors all All sensor records All sensors full Full sensor records Temperature volta
173. t 5 Determine whether you successfully connected to the SP m If you successfully connected to the SP continue with the following procedures a Viewing ILOM SP Event Logs on page 134 a Viewing Replaceable Component Information on page 137 a Viewing Temperature Voltage and Fan Sensor Readings on page 139 m If you could not connect to the SP there might be a problem with the graphics redirect and service processor GRASP board Replace this board and then repeat Step 1 through Step 4 Refer to the Sun Fire X4540 Server Service Manual 819 4359 for instructions Viewing ILOM SP Event Logs Events are notifications that occur in response to some actions The IPMI system event log SEL provides status information about the Sun Fire X4540 server s hardware and software to the ILOM software which displays the events in the ILOM web GUI m If any of the logs or information screens indicate a DIMM error see BIOS DIMM Error Messages on page 127 and Isolating and Correcting DIMM ECC Errors on page 130 134 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 m If the problem with the server is not evident after viewing ILOM SP logs and information continue with Running SunVTS Diagnostic Tests on page 120 To view event logs 1 Log in to the SP as Administrator or Operator to reach the ILOM web GUI a Type the IP address of the server s SP into your web browser The Sun Integrated Lights
174. t Device Priority screen appears 6 Select the DVD ROM drive to be the primary boot device 7 Save and exit the BIOS screens Chapter 9 Using SunVTS Diagnostic Software 121 8 Reboot the server When the server reboots from the CD in the DVD ROM drive the Solaris Operating System boots and SunVTS software starts and opens its first GUI window 9 In the SunVTS GUI press Enter or click the Start button when you are prompted to start the tests The test suite will run until it encounters an error or the test is completed Note The CD will take approximately nine minutes to boot 10 When SunVTS software completes the test review the log files generated during the test Reviewing SunVTS Log Files 1 Click the Log button The Log file window is displayed 2 Specify the log file that you want to view by selecting it from the Log File window The content of the selected log file is displayed in the window 3 Choose the following actions from the three lower buttons m Print the log file A dialog box appears for you to specify your printer options and printer name Delete the log file The file remains displayed but will be gone the next time you try to display it Close the Log file window The window is closed Note To save the log files You must save the log files to another networked system or a removable media device When you use the Bootable Diagnostics CD the server boots from th
175. t Manager Administration Guide 819 1160 m If any of the logs or information screens indicate a DIMM error see Troubleshooting DIMM Problems on page 6 and How DIMM Errors Are Handled by the System on page 125 m If the problem with the server is not evident after viewing ILOM SP logs and information continue with Running SunVTS Diagnostic Tests on page 120 Making a Serial Connection to the SP To make a serial connection to the SP 1 Connect a serial cable from the RJ 45 Serial Management port on your ILOM SP to a terminal device 2 Press ENTER on the terminal device to establish a connection between that terminal device and the ILOM SP Note If you are connecting to the serial port on the SP before it has been powered up or during its power up sequence you will see bootup messages displayed The service processor eventually displays a login prompt For example SUNSP0003BA84D777 login The first string in the prompt is the default host name for the ILOM SP It consists of the prefix SUNSP and the MAC address of the ILOM SP The MAC address for each ILOM SP is unique 3 Log in to the SP and type the default user name root with the default password changeme Once you have successfully logged in to the SP it displays its default command prompt gt 4 To start the serial console type the following commands cd SP console start 5 Determine whether you successfully connected to t
176. t system to clear fault CPU Failure Amber Blinks to indicate that the system has found a fault with a CPU Restart system to clear fault Battery Failure Amber Blinks to indicate that the system has found a fault with the battery Start service processor to clear fault 180 Book Title without trademarks or an abbreviated book title July 2009 APPENDIX C Sun Fire X4540 Sensor Locations This appendix lists the locations of the sensors of the Sun Fire X4540 server TABLE C 1 Name of Sensor sys intsw sys acpi sys nmi sys reset btn sys locate btn sys v_ 3v3stby sys v_ 3v3 bat v_bat sys v_ 12v sys v_ 1v2ht proc prsnt proc front t_amb proc rear t_amb p0 prsnt p0 hot p0 t_core p0 v_vddcore p0 v_ 1v8 p0 v_ 0v9 Location of Sensor Power backplane IO controller board Not a sensor but an NMI button on rear backplane Not a sensor but an NMI button on rear backplane Not a sensor but an NMI button on front backplane CPU board CPU board CPU board CPU board CPU board CPU board CPU board CPU board CPU board CPU board CPU board CPU board CPU board CPU board 181 182 TABLE C 1 Name of Sensor Location of Sensor pl prsnt p1 hot p1 v_vddcore plit core pl v_ 1v8 pl v_ 0v9 io rear t_amb io front t_amb io v_bat io v_ 3v3stby io v 3v3 io v 5v io v 12v io v 5v disk io v 1v5 io v 1v4 io v 1v8 io v 1v2 dbp t amb ft0 f0 speed ft0 f
177. ternal status indicator LEDs which can indicate component malfunction For LED locations and descriptions see Internal Status Indicator LEDs on page 175 and DIMM Fault LEDs on page 128 Note You can hold down the Locate button on the server back panel or front panel for 5 seconds to initiate a push to test mode that illuminates all other LEDs both inside and outside of the chassis for 15 seconds Verify that there are no loose or improperly seated components Verify that all cable connectors inside the system are firmly and correctly attached to their appropriate connectors Verify that any after factory components are qualified and supported For a list of supported PCI cards and DIMMs refer to the Sun Fire X4540 Server Service Manual 819 4359 Check that the installed DIMMs comply with the supported DIMM population rules and configurations as described in Chapter 10 Troubleshooting DIMM Problems on page 123 Replace the component covers Chapter 8 Initial Inspection of the Server 117 9 To restore main power mode to the server all components powered on use a non conducting ballpoint pen or stylus to press and release the Power button on the server front panel See FIGURE 8 4 When main power is applied to the full server the Power OK LED next to the Power button lights and remains lit 10 If the problem with the server is not evident you can try viewing the power on se
178. tery LED continues to indicate a failure until the service processor is started When a UE is detected by the BIOS the DIMM LEDs will also illuminate For more information on CPU fault indicators and replacing CPUs refer to the Sun Fire X4500 Server Service Manual 819 4359 Chapter 1 Initial Inspection of the Server 9 FIGURE 1 3 CPU Module LED and Button Locations ooooooonuno 0 0000000000000000000000 0000000000000000000 0 00000000000000000000000 0000000000000000000 P 0000000000000000000000000 0000000000000000000d sO ee pep a e po me 200000 000000 oodoomononn O O O O O La 200000 Figure Legend DIMM 0213 1 2 3 4 5 6 7 8 CPU 1 under heatsink CPU 0 under heatsink DIMM 3 120 DIMM fault LEDs CPU 1 fault LED Battery Battery fault LED Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 10 Figure Legend 9 CPU 0 fault LED 10 Press to see fault 11 DIMM fault LED DIMM Population Rules The DIMM population rules for the Sun Fire X4500 server are as follows m Each CPU can support a maximum of four DIMMs m The DIMM slots are paired and the DIMMs must be installed in pairs 0 and 1 2 and 3 See FIGURE 1 3 m CPUs with only a single pair of DIMMs must have those DIMMs installed in that CPUs white DIMM slots 0 and 1 See FIGURE 1 3 m Only PC3200 ECC Registered DIMMs are supported m Each pair of DIMMs must be
179. the time it takes for the system to boot m System Configuration Display This option is disabled by default If you enable this the System Configuration screen is displayed before booting begins m Quiet Boot This option is disabled by default If you enable this the Sun Microsystems logo is displayed instead of POST codes Chapter 13 Event Logs and POST Codes 165 POST Codes TABLE 13 1 contains descriptions of each of the POST codes listed in the same order in which they are generated These POST codes appear as a four digit string that is a combination of two digit output from primary I O port 80 and two digit output from secondary I O port 81 In the POST codes listed in TABLE 13 1 the first two digits are from port 81 and the last two digits are from port 80 TABLE 13 1 POST Codes Post Code Description 00d0 Coming out of POR PCI configuration space initialization Enabling 8111 s SMBus 00d1 Keyboard controller BAT Waking up from PM Saving power on CPUID in scratch CMOS 00d2 Disable cache full memory sizing and verify that flat mode is enabled 00d3 Memory detections and sizing in boot block cache disabled IO APIC enabled 01d4 Test base 512KB memory Adjust policies and cache first 8MB 01d5 Bootblock code is copied from ROM to lower RAM BIOS is now executing out of RAM 01d6 Key sequence and OEM specific method is checked to determine if BIOS recovery is forced If next code is E0 BIOS recovery is
180. tion 0037 Displaying sign on message CPU information setup key message and any OEM specific information 56 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 TABLE 5 1 POST Codes Continued Post Code Description 4538 PCI devices IPL device initialization 5538 PCI devices General device initialization 8600 Preparing CPU for booting to OS by copying all of the context of the BSP to all application processors present NOTE APs are left in the CLI HLT state POST Code Checkpoints The POST code checkpoints are the largest set of checkpoints during the BIOS pre boot process TABLE 5 2 describes the type of checkpoints that might occur during the POST portion of the BIOS These two digit checkpoints are the output from primary I O port 80 TABLE 5 2 POST Code Checkpoints Post Code Description 03 Disable NMI Parity video for EGA and DMA controllers At this point only ROM accesses are to the GPNV If boot block BB size is more than 64K you must turn on ROM Decode below addresss FFFF0000h It should allow USB to run in E000 segment The HT must program the NB specific initialization and OEM specific initialization can program if it need at beginning of BIOS POST like overriding the default values of Kernel Variables 04 Check CMOS diagnostic byte to determine if battery power is OK and CMOS checksum is OK Verify CMOS checksum manually by reading storage area If the CMOS checksum is bad update CMOS with pow
181. tle without trademarks or an abbreviated book title July 2009 Using the hd Utility To use hd utility you must have the hd package installed This package is preinstalled in opt SUNWhd hd bin hd For additional commands related to hd see the following man pages for additional commands format 1M cfgadm 1M devfsadm 1M and fdisk 1M hd Utility Mapping You can use the drive mapping output from hd Utility for remote analysis The utility also probes and displays all of the available storage devices in the system with their logical device name serial number vendor model and drive temperatures Here is sample output from the hd utility FIGURE 7 2 Sample hd utility Hard Disk Drive Map Sun Fire X4500 Server Rear 36 37 38 39 40 41 42 43 44 45 46 47 c6t c6t7 act c5t7 c8t3 c8t7 cit3 aft7 clit c1l1t7 cot c0t7 Att Att Att Att Att A Att C Att Att A 24 25 26 Aki 28 29 30 31 32 33 34 ZO c6t2 c6t6 c5t2 c5t6 c8t2 c8t6 c7t2 cit6 calt2 cat c0t2 cOt6 Ap A Att Att Att Att Att Att Att Att Att A 12 13 14 15 3 16 T3 18 19 20 21 22 23 c6t1 c6t c5tl ct c8t1I c8t c7t1 c7t cilti cit c0t1 cOt Att Att Att Att Att A A Att A A A A 0 1 2 Be as 5 6 mf 8 9 10 dds c6t c6t4 act e5t4 c8tO c8t4 c7tO c7t4 cit clt4 cOt cot4 RDE b tt tt tt tt C Att C Ago Apr x Sun Fire X4500 Se
182. to an individual DIMM Instead it might be caused by CPUO or by the DIMM slot Continue with the next step Shut down the server again and disconnect the AC power cords Remove both DIMMs of the pair and install them into paired slots on the second CPU board that did not indicate a DIMM problem Using the slot numbers in the example install the two DIMMs from CPUO slots 1 and 3 into CPU1 slots 1 and 3 or CPU1 slots 0 and 2 Reconnect AC power cords to the server Power on the server and run the diagnostics test again Review the log file m If the error now appears under the CPU that manages the DIMM slots you just installed the problem is with the DIMMs Return both DIMMs the pair to the Support Center for replacement m If the error remains with the original CPU there is a problem with that CPU Chapter 1 Initial Inspection of the Server 13 14 Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 CHAPTER 2 Using SunVTS Diagnostic Software This chapter contains information about the Sun diagnostic software tools This chapter includes the following topics m Running SunVTS Diagnostic Tests on page 15 m Diagnosing Server Problems With the Bootable Diagnostics CD on page 16 Running Sun VTS Diagnostic Tests The Sun Fire X4500 servers are shipped with a Bootable Diagnostics CD that contains SunVTS software SunVTS is the Sun Validation Test Suite which provides a comprehensive diagn
183. ul shutdown 4 115 guidelines for troubleshooting 2 111 H hardware errors handling 102 196 hd utility 71 ILOM SP GUI general information 19 serial connection 20 133 time stamps 23 137 viewing component inventory 24 137 viewing sensors 26 139 viewing SP event log 21 134 inspection external 4 112 internal 4 115 Integrated Lights Out Manager Service Processor See ILOM SP GUI Intelligent Platform Management Interface See IPMI internal inspection 4 115 internal LEDs 66 175 IPMI general information 32 143 IPMItool changing default password 34 145 clearing SP SEL 40 151 configuring SSH key 34 145 connecting to server 33 144 enabling anonymous user 33 145 general information 32 144 LED modes 44 155 LED sensor groups 44 156 LED sensor IDs 42 154 location of package 32 144 man page 32 144 setting LED status 42 153 using scripts for testing 45 156 using SDR cache 40 151 viewing component inventory 41 152 viewing LED status 42 153 viewing sensor status 35 146 viewing SP SEL 38 149 202 Undefined BookTitleFooter July 2009 isolating DIMM ECC errors 11 130 L LEDs external 61 front panel locations 62 63 113 172 173 internal 66 175 modes 44 155 sensor groups 44 156 sensor IDs 42 154 setting status with IPMItool 42 153 viewing status with IPMItool 42 153 logical to physical device mapping 71 M mapping sensor numbers to sensor names 4
184. ulty DRAM and execute the following items ipmitool gt sel list 100 08 26 2005 11 36 09 OEM 0xfb 200 08 26 2005 11 36 12 System Firmware Error No usable system memory 300 08 26 2005 11 36 12 Memory Memory Device Disabled CPU 0 DIMM 0 a When the faulty DIMM is beyond the BIOS s low 1MB extraction space proper boot happens ipmitool gt sel list 100 08 26 2005 05 04 04 OEM 0xfb 200 08 26 2005 05 04 09 Memory Memory Device Disabled CPU 0 DIMM 0 m Note the following considerations for this revision a Uncorrectable ECC Memory Error is not reported m Multi bit ECC errors are reported as Memory Device Disabled a On first reboot BIOS logs a HyperTransport Error in the DMI log a The BIOS disables the DIMM m The BIOS sends the SEL records to the BMC a The BIOS reboots again m The BIOS skips the faulty DIMM on the next POST memory test a The BIOS reports available memory excluding the faulty DIMM pair FIGURE B 1 shows an example of a DMI log screen from BIOS Setup Page 92 Book Title without trademarks or an abbreviated book title July 2009 FIGURE B 1 DMI Log Screen Uncorrectable Error BIOS SETUP UTILITY Advanced Event Logging details View all unread events m on the Event Log Mark all events as read Clear Event Log View Event Log 09 12 05 11 51 05 A Hyper Transport sync flood error occurred on last boot Enter Go to Sub Screen Fi Gene
185. v02 53 C Copyright 1985 gt American Megatrends Inc OACh OAZh Appendix B Error Handling 95 96 Handling of Parity Errors PERR This section lists facts and considerations about how the server handles parity errors PERR m The handling of parity errors works through NMIs m During BIOS POST the NMI is logged in the DMI and the SP SEL See the following example command and output root d mpk12 53 238 root ipmitool H 129 146 53 95 U root P changeme I lan sel list v SEL Record ID Record Type Timestamp Generator ID EvM Revision Sensor Type Sensor Number Event Type Event Direction Event Data Description 0100 00 01 10 2002 20 16 16 0001 04 Critical Interrupt 00 Sensor specific Discrete Assertion Event O4ff00 PCI PERR m FIGURE B 4 shows an example of a DMI log screen from BIOS Setup Page with a parity error Book Title without trademarks or an abbreviated book title July 2009 FIGURE B 4 DMI Log Screen PCI Parity Error BIOS SETUP UTILITY View Event Log Cle vent Log View Event Log 09 12 05 14 27 47 PCI Parity m The BIOS displays the following messages and freezes during POST or DOS m NMI EVENT m System Halted due to Fatal NMI m The Linux NMI trap catches the interrupt and reports the following NMI confusion report sequence Aug 5 05 15 00 d mpk12 53 159 kernel Uhhuh NMI received for unknown reason 2d on CPU 0 Aug 5 05 15 00 d mpk
186. wn the server from main power mode to standby power mode a Graceful shutdown Use a non conducting ballpoint pen or stylus to press and release the Power button on the front panel This causes Advanced Configuration and Power Interface ACPI enabled operating systems to perform an orderly shutdown of the operating system Servers not running ACPI enabled operating systems will shut down to standby power mode immediately a Emergency shutdown Use a ballpoint pen or stylus to press and hold the Power button for four seconds to force main power off and enter standby power mode Sun Fire X4500 X4540 Servers Diagnostics Guide July 2009 When main power is off the Power OK LED on the front panel blinks once every three seconds indicating that the server is in standby power mode See FIGURE 1 2 Caution When you use the Power button to enter standby power mode power is still directed to the graphics redirect and service processor GRASP board and power supply fans indicated when the Power OK LED is blinking To completely power off the server disconnect the AC power cords from the back panel of the server FIGURE 1 2 Sun Fire X4500 Server Front Panel 1 2 vies aa Figure Legend 1 Locate button 2 Power OK LED 3 USB ports 2 2 Remove the component covers including hard disk drive cover system controller cover and fan cover as required For instructions on removing the component covers re
187. yed 159 3 View the BIOS event log a From the BIOS Main Menu screen select Advanced The Advanced Settings screen is displayed Main Advanced PCIPnP Boot Security Chipset Exit KKK KKKKKKKKKK KKK KKK KKK KKK KKK KKK KK KK KKK KKK KEK KR KEKE kkk kkk kekk kkk kkk kkk kk kkk kkk kkk kkk Advanced Settings Options for CPU ai KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KEK KKK KKK KKKKK KEK KKK KKK KEK OK e WARNING Setting wrong values in below sections bi may cause system to malfunction a k CPU Configuration k k IDE Configuration ka SuperIO Configuration li ACPI Configuration k Event Log Configuration Hyper Transport Configuration x IPMI 2 0 Configuration k a MPS Configuration KONE Select Screen x PCI express Configuration la AMD PowerNow Configuration KR Select Item Remote Access Configuration Enter Go to Sub Screen USB Configuration FL General Help kaj XOEL Save and Exit iaj ESC Exit k E ttttt t itttttttttStStS KKK ST AA AH HA AH AN ANANA AA NNA AAA AH KKK AA KANA A KKK KKK K KKK KKK KKK KK KKK b From the Advanced Settings screen select Event Log Configuration The Advanced Menu Event Logging Details screen is displayed Advanced KKK KEK KKKKKKKK KKK KKK KKK ST A A AH KKK KKK KANA NANA KE KK KEKE KK KEKE KKK KK KK KKK KKK A H AN HAN AANH A NASNS S Event Logging details View all unread events KKK KKK KKK KKK KKK KKK
188. ylus or a straightened paper clip into the recess Blue Ready to remove 8 Amber Fault service action required A Green Operational no action required Serial management port serial connection to service processor Net management and service processor port GigabitEthernet ports connect server to Ethernet Connect USB devices Connect video monitor Insert compact flash card devices Internal Status Indicator LEDs The Sun Fire X4540 server has internal status board LEDs for the CPU board the CPU and DIMM slots on the CPU board The system includes internal LEDs on the disk drives the fan trays and the PCI slots See the following figures and tables for information about the LEDs that you can view inside of the server Chapter 14 Identifying Status and Fault LEDs 175 m FIGURE 14 4 and FIGURE 14 5 show the disk drive and fan tray LEDs m FIGURE 14 6 and TABLE 14 3 describe the internal LED and button locations Disk Drive and Fan Tray LEDs FIGURE 14 4 shows the location of the disk drive and fan trays FIGURE 14 5 shows a close up view of the disk drive and fan trays and also shows the symbols that identify the LEDs 176 Book Title without trademarks or an abbreviated book title July 2009 FIGURE 14 4 Disk Drives and Fan Trays Chapter 14 Identifying Status and Fault LEDs 177 FIGURE 14 5 Disk Drive and Fan Tray LEDs Ready to Remove Fault Service action OA Genice action a
Download Pdf Manuals
Related Search
Related Contents
PDF-Datei - baumarktwissen.eu N°25 - ÉTÉ 2009 - le voyageur debout 音声ICユニットAT STANDARD COMPRESSION FLEXOMETER BENUTZER 税理士のための電子申告Q&A(平成24年1月版) comprobadores Flownex SE Version 8.1 SP2 Release Notes DE - SKF.com Manual del Usuario FERSYSTEM 100 - TECHNO IMPIANTI Srl Copyright © All rights reserved.
Failed to retrieve file