Home
PRIMECLUSTER Reliant Monitor Services (RMS) with
Contents
1. o m 0 a View Composite Subapplication Graph o J al view logfile Switch huswitch gt Forced switch hvswitch f Eno ny aun c UjiI3RMS fujiZRMS Offline trvutil o Clear fautt tnvutil c JO eee I Affiliation Japp2 Comment lapp2 15974 2003 09 04 17 22 02 Scripts ScriptTimeout 1300 PreCheckScript hvexec p app2 mydemo PreOnlineScript irm f usr optreliantitmp app2 goingoffline if SHV_INTENDED_STATE t On PreOfflineScript hvenable app2 ALL rm f usr optireliantitmp app2 online touch usroptreliantt OfflineDoneScript irm fJusroptreliantitmp app2 goingoffline Online Offline Fault _Tms amp pcs Java Applet Window it Offline Deact Faulted Unknown Inconsistent stand By ng Figure 86 Command pop up 114 U42117 J Z100 4 76 Administration Using Cluster Admin 5 2 5 7 RMS graph customization By default the RMS graph does not display the resource object names on the graphs These are available as tool tips and can be seen by placing the mouse over a particular object To add resource names affiliation names or both to the graphs use the checkboxes on the Preferences menu Figure 87 shows a graph that displays affiliation names ES Cluster Admin Fe Tos Preferences Ha E FUJI C Show Resource Name O wirm
2. 178 Results of keyword based search 179 Results of severity level based search 181 Controlling the log level with PCS 184 U42117 J Z100 4 76 383 Figures 384 U42117 J Z100 4 76 Tables Table 1 Available CLI commands 24 Table 2 RMS base directory structure 29 Table 3 Log directory structure 2 0 0 30 Table 4 RMS host name conventions in etc hosts 35 Table 5 Cluster site planning worksheet 59 Table 6 Switch processing activities 166 Table 7 Logfiles v mA a ea omita ae ee a aaa d 172 Table 8 Descriptions of severity levels 180 Table 9 Log levels Ts sa seg ee al Bo oe ea ae ya el A 182 Table 10 Object types 1 2 323 U42117 J Z100 4 76 385 Tables 386 U42117 J Z100 4 76 Index gt gt input prompt 41 A activating application 138 configuration 44 49 77 configuration second time 88 administrative privileges 93 Affiliation attribute 335 alternate interfaces 35 73 Alternatelp attribute 325 Alternatelps 35 73 andOp attributes 323 description 323 application switching to SysNode 23 25 application graph 110 application logs displaying 105 files 103 searching text 107 applications activating 138 as objects 11 dependencies 98 displaying states 145 going offline 157 switching over 1
3. Find what Find Next Next Direction Cancel v Search from top Eo Java Applet Window 2003 09 16 08 21 41 368 SWT 39 NOTICE Processing normal switch request for application ap 2003 09 16 08 21 41 449 UAP 13 NOTICE app2 AdminSwitch application is expected to go onlin 2003 09 16 08 21 41 466 US 22 NOTICE app2 starting PreCheck 2003 09 16 08 21 42 890 US 27 NOTICE app2 PreCheck successful 2003 09 16 08 21 42 890 US 17 NOTICE app2 starting Online processing 2003 09 16 08 21 42 899 UAP 13 NOTICE app1 AdminSwitch application is expected to go onlin 2003 09 16 08 21 42 903 PADE ie 4st IN NOTICE annt iac Baci auecaasti 22 NOTICE appt starting PreCheck K mcs oo Status Done Figure 80 Using the Find pop up in log viewer U42117 J Z100 4 76 107 Using Cluster Admin Administration 5 2 5 RMS graphs Cluster Admin contains the following RMS graphs which are useful for graphi cally viewing the details of the RMS configuration file e Full graph Displays the complete cluster configuration e Application graph Shows all of the resources used by an application and can be used to look at specific resource properties e Subapplication graph Lists all of the subapplications used by a given appli cation and it shows the connections between the subapplications e Composite subapplications graph Shows all the subapplications that the applicatio
4. o 187 7 9 1 RMS Wizards detector logging 189 7 9 2 Modifying levels while RMS is running 190 U42117 J Z100 4 76 Contents 7 10 7 10 1 7 11 PCS log files o 191 Manual Script Execution o 191 RMS troubleshooting a 191 Non fatal error messages 195 ADC Admin configuration 196 ADM Admin command and detector queues 205 BAS Startup and configuration errors 227 BM Base monitor 232 CML Command line o e 243 CRT Contracts and contract jobs 245 CTL Controllers 00000 a ee we a 246 CUP userApplication contracts 04 4 247 DET Detectors at wank a ae ee we he Pe IS 248 GEN Generic detector 0 o 252 INI initscript 2 o o eo 253 MIS Miscellaneous o o e 254 NOD Node detector o e 254 QUE Message queues o a 259 SCRE Scripts soc Daie a a a fe 259 SWT Switch requests hvswitch command 260 SYS SysNode objects o 262 UAP userApplication objects 266 US US fil s ane re Rk wee Pw a RR 270 WET Walt list ine Actin tias a taa 271 WRP Wrappers 2 2 000 ee ee 272 Fatalerrormessages
5. Entry type RMS naming pattern Examples Primary host name lt hostname gt RMS fuji2RMS fuji3RMS Alternate interfaces lt hostname gt rmsAl lt nn gt fuji2rmsAIOl Alternatelps where lt nn gt is a zero fuji2rmsAl02 filled sequence number in the range 01 to 99 Table 4 RMS host name conventions in etc hosts of the RELIANT_HOSTNAME variable in that machine s hvenv local configuration file if that file exists i The primary RMS host name for a machine must match the contents Example The following entries are for a cluster with hosts fuji2 and fuji3 each of which have two alternate network interfaces 72 25 219 83 fuji2 72 25 219 84 fuji3 host names for RMS 92 168 1 1 fuji2RMS 92 168 1 2 fuji3RMS 92 168 1 11 fuji2rmsAI01 alt for fuji2 92 168 1 21 fuji2rmsAl02 alt for fuji2 92 168 1 12 fuji3rmsAI0Ol alt for fuji3 92 168 1 22 fuji3rmsAl02 alt for fuji3 e rhosts Contains entries to control trusted login from remote hosts The Wizard Tools require automatic login as root on every machine in the cluster so the rhosts file must be modified appropriately on each node See the rhosts manual page for a complete description of the format U42117 J Z100 4 76 35 Site preparation Using the Wizard Tools interface Example If the cluster consists of hosts fuji2 and fuji3 then every machine s rhosts file should contain the following lines fuji2 root f
6. Offline Unknown Deact Buses Inconsistent Stand By Warning Offline Fault msec SEN IZA D Java Applet Window Figure 113 Displaying application states CLI The syntax for the CLI is as follows hvdisp a c L o out file The a option displays the resource_name resource_type HostName attribute for each resource in the configuration The c options displays all infor mation in compact format The o out_file option is used to send the output to a file called out_file The hvdisp command only works when RMS is running and does not require root privilege 5 3 11 Viewing the switchlog View the switchlog for a system node as follows U42117 J Z100 4 76 145 RMS procedures Administration gt Right click on the system node and select the View Switchlog option from the pop up menu For more details refer to the RMS Troubleshooting Guide You may search the logs based on keywords date time ranges severity levels or exit codes using the log viewer CLI You can view the switchlog file var opt SMAWRrms 1log switchlog using a standard UNIX editor like vi The RMS Troubleshooting Guide describes the RMS log files and their contents 5 3 12 Viewing application logs View the application logs as follows gt Right click on an application on the RMS tree and choose View logfile for more details refer to the RMS Troubleshooting Guide 5 3 13 Viewing GUI message
7. 0 281 ADC Admin configuration 04 282 ADM Admin command and detector queues 282 BM Base monitor 2 000022 eee eee 283 CML Command line o 285 CMM Communication o e e 285 CRT Contracts and contract jobs 286 DET Detectors vos cos nadaa ra e eS 286 INI IMESCHP 06090006 nt e AA 287 MIS Miscellaneous o e e e 289 QUE Message queues o a 289 SORESCMIPIS dica oh a e Bee Re a E ad aea 290 SYS SysNode objects 292 UAP userApplication objects 293 US USTIOS so ue eli a oe ed ao ee A 293 WET Waitlist se ace d a minaca 00 hos eee Oo 294 WRPIWTappers 3 20408 o aed Gee wo Sabah ou 2a 294 U42117 J Z100 4 76 Contents 10 Console error messages 295 10 1 Console messages in alphabetical order 295 11 Appendix Operating system error numbers 321 12 Appendix Objecttypes 323 13 Appendix Attributes 325 13 1 Attributes available tothe user 325 13 2 Attributes managed by configuration wizards 335 14 Appendix Environment variables 341 14 1 Global environment variables 341 14 2 Local environment variables 345 15 Appendix List of manual pages 349 15 1 COBR aaa
8. 10 Console error messages This chapter contains a detailed list of all RMS error messages that appear on the console The messages are listed here in alphabetical order messages that begin with replaceable strings are listed first Most messages are accompanied by a description of the probable cause s and a suggested action to correct the problem In some cases the description or action is self evident and no further information is necessary Some messages in the listings that follow contain words printed in italics These words are placeholders for values names or strings that will be inserted in the actual message when the error occurs 10 1 Console messages in alphabetical order e commandl cannot get list of resources via lt command2 gt from hvcm The wizards rely on hvmod for dynamic modification If there is a problem executing command command2 this message is the result and hvmod exits with exit code 15 Action Contact field support e command failed due to errors in lt argument gt When hvmod has been invoked it uses hvbui ld internally if there is a problem with the execution of hvbui 1d this message is the result and hvmod is aborted hvmod then exits with exit code 1 Action Contact field support e command bad state state If hvassert is performed for a state state which is not among the states that can be asserted this message is the result and hvassert exits with exit code 1 Action Make su
9. Figure 20 Remove hosts from a cluster menu This menu lists all nodes currently in the cluster Machines can be removed by selecting them individually or by selecting 4 ALL from the menu In either case machines being used by one or more applications cannot be removed 4 4 Creating an application After you have defined the set of hosts that form the cluster you can configure an application that will run on those hosts In this step we will first create the application using the DEMO turnkey wizard Begin at the Main configuration menu Figure 21 fuji2 Main configuration menu current configuration mydemo No RMS active in the cluster 1 HELP 0 Configuration Remove 2 QUIT 1 Configuration Freeze 3 Application Create 2 Configuration Thaw 4 Application Edit 3 Configuration Edit Global Settings 5 Application Remove 4 Configuration Consistency Report 6 Application Clone 5 Configuration ScriptExecution 7 Configuration Generate 6 RMS CreateMachine 8 Configuration Activate 7 RMS RemoveMachine 9 Configuration Copy Choose an action 3 Figure 21 Main configuration menu gt Select Application Create by entering the number 3 The Application type selection menu appears Figure 22 U42117 J Z100 4 76 61 Creating an application Configuration example Creation Application type selection menu 1 HELP 2 QUIT 3 RETURN 4 OPTIONS 5 DEMO 6 GENERIC 7 LIVECACHE 8 R3AN
10. File Tools Preferences Help Bru MO tuyiaems I appz E z m Jjermcormma vem c mydemo 06 app MonitorOnly 0 MO 1 misrns NoDisplay 0 o DJ appz Scripts oD ap ScriptTimeout 300 Y ERROR Sysnode fuji3RMS faulted Applications O fuji2 99 fuji an 0 a amz 0 al 7 Show State Names Java Applet Window Qonline it Offline Q Deact Faulted Unknown PM inconsistent Dstand By Qwerning Bottlinerauit Ter rmsapes sis ms Java Applet Window Figure 93 Exclamation marks in clusterwide table and the RMS tree U42117 J Z100 4 76 121 Using Cluster Admin Administration 5 2 6 1 Command pop ups Use the context sensitive command pop up menus to perform some of the operations on the clusterwide table nodes Invoke the pop up menu by right clicking on an object The menu options are based on the type and the current state of the selected node Figure 94 EARMS cluster FUJI Brel al Dependent Controlled Application View Application Graph View Subapplication Graph View logfile Clear fault hwutil c Java Applet Window Figure 94 Command pop ups in clusterwide table 122 U42117 J Z100 4 76 Administration Using Cluster Admin 5 2 7 Changing the RMS configuration When you stop and restart RMS with a different configuration the graphs the clusterwide table and the RMS tree are redraw
11. PriorityList fuji RMS fuji2RMS OnlinePriority 0 PersistentFault J0 NoDisplay J0 Affiliation jlapp2 Comment app2 15974 2003 09 04 17 22 02 Scripts ScriptTimeout 300 PreCheckScript hvexec p app2 mydemo m fiusroptrelianttmp app2 goingoffline if HV_INTENDED_STATE Onli hvenable app2 ALL rm fustioptireliantitmp app2 online touch usrioptireliantitn irm flusrioptireliantitmp app2 goingofline Figure 111 Local environmental variables pop up Displaying the local environment variables displays the clusterwide environment variables as well Figure 112 U42117 J Z100 4 76 143 RMS procedures Administration ES Cluster Admin File Tools Preferences Help QW rus MO jirus IO iz TETE Earn sanity fuji o PUNZAMS E Q e appl OpUSMAWISMAWRrms biniinitScript HO nirus OpUSMAWISMAWRrms build o DO app CRIPTS_TIME_OU 300 eoD am W_CONNECT_TIMEOUT 0 V_LOG_ACTION off short long all off ELIANT_PATH JopySMAWWSMAWRrms ELIANT_LOG_PATH fvarloptSMAWRmsilog ELIANT_LOG_LIFE 7 ELIANT_SHUT_MIN_WAIT 150 IV_AUTOSTART_WAIT 60 _CHECKSUM_INTERVAL 120 zi LOG ACTION THRESHOLO as z sel Online it Offline Deact Faulted Unknown Inconsistent stand By ing Offline Fault Fer rmsapes sis msg Java Ap
12. 9 7 DET Detectors e DET 8 Failed to create DET _REP_Q If RMS is unable to create the Unix Message queue DET_REP_Q for communication between a detector and itself this message is the result and RMS exits with exit code 12 Action Contact field support 286 U42117 J Z100 4 76 Fatal error messages INI init script e DET 9 Message send failed in detector request Q queue During hv1logclean the detector request queue queue is used for sending information to the detector from the base monitor If there is a problem in communication this message is the result and RMS exits with exit code 12 Action Contact field support e DET 16 Cannot create gdet queue of kind gkind Each of the generic detectors has a message queue which it uses to communicate with the base monitor If there is a problem creating a queue for a detector of kind kind this message is the result and RMS exits with exit code 12 Action Contact field support e DET 18 Error reading hvgdstartup file Error message errorreason When the RMS base monitor tries starting up the generic detectors it parses the hvgdstartup file for detector information If RMS encounters an error while reading this file it prints this message along with the reason errorreason for the failure RMS then exits with exit code 26 Action Contact field support 9 8 INI init script e INI 4 InitScript does not have execute permission InitScrip
13. The attribute lt attribute gt is constant and can only be set in a configuration file Action Make sure that there is no attempt to modify lt attribute gt within lt object gt e ADM 77 58 Dynamic modification failed cannot delete object object since its state is currently being asserted This message can appear in the switchlog if dynamic modification is being performed on an object that is being asserted Action Perform the modification after the assertion has been fulfilled e ADM 78 59 Dynamic modification failed PriorityList lt prioritylist gt does not include all the hosts where the appli cation lt appname gt may become Online Make sure that Priori tyList contains all hosts from the HostName attribute of the application s children Set PriorityList for lt appname gt to include all the host names from the HostName attribute of the application s children Action No duplicate host names should be present in the PriorityList U42117 J Z100 4 76 221 ADM Admin command and detector queues Non fatal error messages e ADM 79 60 Dynamic modification failed PriorityList lt prioritylist gt includes hosts where the application lt appname gt cannot go Online Make sure PriorityList contains only hosts from the HostName attributes of the application s children The HostName attribute of one or more of the children specifies hosts that are not in the parent s PriorityList attribute
14. Non fatal error messages BM Base monitor Make sure that the RMS is running on the other hosts in the cluster and also whether there are any network issues e BM 13 S4 no symbol for object lt object gt in inp file line linenumber RMS internal error Action Contact field support e BM 14 S6 local queue is empty on read directive in line linenumber RMS internal error Action Contact field support e BM 15 S2 destination object lt object gt is absent in line linenumber RMS internal error Action Contact field support e BM 16 S2 sender object lt object gt is absent in line linenumber RMS internal error Action Contact field support e BM 17 53 Dynamic modification failed line linenumber cannot build an object of unknown type lt symbol gt An object of unknown type is added during dynamic modification Action Use only objects of known types in configuration files U42117 J Z100 4 76 233 BM Base monitor Non fatal error messages e BM 18 54 Dynamic modification failed line linenumber cannot set value for attribute lt attribute gt since object lt object gt does not exist An attribute of a non existing object cannot be modified Action Modify attributes only for existing objects e BM 19 39 Dynamic modification failed line linenumber cannot modify attribute lt attribute gt of object lt object gt with value lt value g
15. Contact field support e SWT 25 objectname outstanding switch request of dead host was denied cluster may be in an inconsistent condition A host died during the processing of a switch request If the host that takes over the responsibility for that particular userApplication tried to proceed with the partly done switch request but another host does not agree This indicates a severe cluster inconsistency and critical internal error Action Contact field support e SWT 26 object dead host lt hostname gt was holding an unknown lock Lock will be skipped This message appears when the dead host lt hostname gt was holding a lock that is unknown to the new responsible host Action Allow time for the cluster to cleanup e SWT 45 hvshut aborted because of a busy uap lt appname gt The hvshut request was aborted because the application is busy Action Do not shut down RMS when its applications are busy Make sure the application finishes its processing before shutting down RMS e SWT 46 hvshut aborted because modification is in progress U42117 J Z100 4 76 261 SYS SysNode objects Non fatal error messages The hvshut request was aborted because dynamic modification is in progress Action Do not shut down RMS while dynamic modification is in progress Wait until dynamic modification finishes before shutting down RMS 8 17 SYS SysNode objects SYS 1 Error on SysNode object It failed
16. Represents a machine that is running as a node in a cluster e gResource Represents a generic resource that is to be defined according to the needs of a customer application In a typical configuration one detector can be associated with all objects of the same type 54 U42117 J Z100 4 76 Using the Wizard Tools interface Further reading 3 7 Further reading The preceding sections were intended to make the reader familiar with some basic concepts and methods of the RMS Wizards More information may be obtained from a number of documents that provide further reading on these tools and the way they are used RMS Wizards documentation package The RMS Wizards documentation package is available in HTML format on the PRIMECLUSTER CD ROM The information is presented in separate direc tories covering the following major topics e Primer Provides an introduction to the RMS Wizards covering many features in more detail than is possible in this chapter e Wizards Provides information on individual wizards of all three kinds described in this chapter Covers turnkey wizards resource wizards and other wizards including the generic wizard e Scripts and tools Provides information on some scripts and tools that may be useful in setting up a high availability configuration by means of the RMS Wizards Includes the gresources sub section which contains descriptions of a number of detectors Gresources are defined as ph
17. See also local node node reporting message RMS A message that a detector uses to report the state of a particular resource to the base monitor resource RMS A hardware or software element private or shared that provides a function such as a mirrored disk mirrored disk pieces or a database server A local resource is monitored only by the local node See also private resource RMS shared resource resource definition RMS See object definition RMS resource label RMS The name of the resource as displayed in a system graph resource state RMS Current state of a resource RMS See Reliant Monitor Services RMS RMS commands Commands that enable RMS resources to be administered from the command line RMS configuration A configuration made up of two or more nodes connected to shared resources Each node has its own copy of operating system and RMS software as well as its own applications U42117 J Z100 4 76 369 Glossary RMS Wizard Kit RMS configuration products that have been designed for specific appli cations Each component of the Wizard Kit includes customized default settings subapplications detectors and scripts These application wizards also tailor the Wizard Tools or PCS interface to provide controls for the additional features See also RMS Wizard Tools Reliant Monitor Services RMS RMS Wizard Tools A software package composed of various configuration and adminis trati
18. Trying to link a child lt childobject gt that is online to a parent object which is supposed to go offline is not allowed and dynamic modification aborts Action Make sure that the parent and the child are in a similar state e ADM 43 10 Dynamic modification failed linking the same resource lt childobject gt to different applications lt userapplicationl gt and lt userapplication2 gt When RMS gets a directive to add a new child object lt childobject gt having as parent and child resources belonging to different applications lt userapplication1 gt and lt userapplication2 gt the above message is printed and dynamic modification aborts Action U42117 J Z100 4 76 215 ADM Admin command and detector queues Non fatal error messages When adding a new resource make sure that it does not have as its parent and children resources belonging to different applications e ADM 44 11 Dynamic modification failed object lt object gt does not have an existing parent Any attempt to create an object lt object gt that does not have an existing parent leads to this message and dynamic modification aborts Action Make sure that the object lt object gt has an existing object as its parent e ADM 45 55 Dynamic modification failed HostName is absent or invalid for resource lt object gt If the HostName attribute of object lt object gt is an invalid value then this message occurs and dynamic mod
19. and due to the previous error it will remain online but neither automatic nor manual switchover will be possible on this host until lt detector gt detector will report offline or faulted The checksums of the configurations of the local and the remote host are different no more than the number of seconds determined by the value of the environment variable HV_CHECKSUM_INTERVAL have passed and not all of the applications are offline or faulted RMS will continue to remain online but neither automatic nor manual switchover will be possible on this host until the detector detector reports offline or faulted Action Make sure that both the local and the remote host are running the same configuration e ADC 3 Remote host lt hostname gt reported the checksum remotechecksum which is different from the local checksum localchecksum 196 U42117 J Z100 4 76 Non fatal error messages ADC Admin configuration If the checksum of the configuration file reported by the remote host lt hostname gt is different from the checksum of the configuration file on the local host this message will appear Action The most likely cause for this would be that the local host and the remote host are running configuration files that differ Make sure that the local host and the remote host are running the same configuration file e ADC 4 Host lt hostname gt is not in the local configuration This message is a result of the followin
20. ityList attributes Each child application must be able to run on the same set of nodes as its parent application the controller keeps track of and sends requests to its child applications that are running on the same node Other attributes of the controller object must be set as follows these are set automatically by the configuration tools Ignore0fflineRequest 0 IgnoreOnlineRequest 0 Scalable 0 If Fol low is set to 1 Scalable must be set to 0 Follow and Scalable control policies are mutually exclusive U42117 J Z100 4 76 327 Attributes available to the user Appendix Attributes eo Halt Possible Values 0 1 Default 0 Valid for userApplication objects Eliminates a node if a double fault occurs e I_List Possible Values Space separated list of SysNode names Default empty Valid for all SysNode objects List of additional cluster interconnects that should be monitored by RMS These interconnects are used only by customer applications and not by any PRIMECLUSTER products All monitored interconnects must be found in the etc hosts database In addition all SysNode objects must have the same number of additional interconnects e MaxControllers Possible Values 0 512 Default 512 Valid for userApplication objects Upper limit of parent userApp1i cation objects for the specified child application e MonitorOnly Possible Values O 1 Default 0 Valid for resource objects If set to 1
21. on page 152 U42117 J Z100 4 76 167 Switch processing Advanced RMS concepts 168 U42117 J Z100 4 76 7 Troubleshooting This chapter discusses some PRIMECLUSTER facilities for debugging the RMS product from both the command line interface CLI and from the Cluster Admin graphical user interface GUI This chapter provides details on log files their location how to turn on logging levels how to view logs from the GUI and how to view log files from CLI This chapter discusses the following e The section Overview on page 169 summarizes the troubleshooting process e The section Debug and error messages on page 171 describes RMS debug and error messages e The section Log files on page 172 identifies and explains the RMS log files e The section Using the log viewer on page 174 explains the log viewer facil ities e The section Specifying the log level on page 182 specifies and explains the log levels e The section Interpreting log files on page 185 explains the meaning of the data in the log files e The section System log on page 186 describes the system log e The section Wizard log files on page 187 details the RMS Wizard log files e The section PCS log files on page 191 lists the locations of the PCS log files e The section RMS troubleshooting on page 191 supplies solutions to problems that could occur while using RMS 7 1 Overview
22. process lt process gt lt name gt to host lt host gt RMS exchanges messages between processes and hosts to maintain inter host communication If the delivery of a message has failed then this error is the result This can occur if one or more hosts in the cluster are not active or if there is a problem with the network Action i Check the other hosts in the cluster If any are not alive check the switchlog for information regarding why RMS has died on those hosts Perform the following steps in order 1 hvdisp a 2 In the output of step a check if the state of any of the resources whose type is SysNode is offline If so that means that RMS is not running on that node 274 U42117 J Z100 4 76 Non fatal error messages WRP Wrappers 3 Check the switchlogs of all the nodes that are offline to determine the reason why RMS on that node is not active ii If the other hosts that are part of the cluster are alive then that means there is some problem with the network WRP 12 Failed to bind port to socket This could occur if RMS is unable to bind the endpoint for communi cation Action Contact field support WRP 14 No available slot to create a new host instance When the base monitor for RMS starts up it creates a slot in an internal data structure for every host in the cluster When hvdet_node is started up RMS sends it a list of the SysNode objects that are put into different slots in the in
23. 1 Default 1 start RMS in the rc script Determines if RMS is started in the rc script Prerequisite for rc start CONFIG rms exists and contains a valid entry e HV_REALTIMEPRIORITY Possible values 0 99 Default 50 Defines the real time priority for the RMS base monitor and its detectors Caution should be used when adjusting this variable High settings can prevent other OS real time processes from getting their processor time slice Low settings can prevent the RMS base monitor from reacting to detector reports and from performing requests from command line utilities 346 U42117 J Z100 4 76 Appendix Environment variables Local environment variables By default the base monitor and detectors are real time processes However if the base monitor has been started with the R non real time flag the value of HV_REALTIME_PRIRORITY is disregarded e HV_SCRIPTS_DEBUG Possible values O 1 Default 0 Controls debugging output from RMS scripts If this variable is set to 1 each script writes detailed information about the commands that are executed to the RMS switchl og file The type of information logged may vary according to the script This setting applies only to those scripts provided with PRIME CLUSTER products To disable script debug message logging delete the HV_SCRIPTS_DEBUG entry or set HV_SCRIPTS_DEBUG 0 in hvenv local e HV_SYSLOG_USE Possible values 0 1 Default 1 in hvenv Controls outp
24. 12 Action System problem Contact field support e SCR 5 REQUIRED PROCESS RESTART FAILED Unable to restart detector Shutting down RMS 290 U42117 J Z100 4 76 Fatal error messages SCR Scripts If the detector detector could not be restarted this message is the result with RMS shutting down with exit code 14 The restart could have failed for any of the following reasons Ifthe detector needs to be restarted more than 3 times in one minute If there is a problem with memory allocation within RMS Action Contact field support e SCR 10 InitScript did not run ok RMS is being shut down RMS runs the InitScript initially The value of InitScript is the value of the environment variable RELIANT_INITSCRIPT in hvenv For some reason if this InitScript fails like exiting with a non zero code getting a signal etc then this message is printed and RMS shuts down with exit code 56 Action Contact field support e SCR 12 incorrect initialization of RealDetReport Shutting down RMS Since the scripts are executed based on the reports of the detectors if the detector reports a state other than Online Offline Faulted Standby or NoReport this message is the result with RMS exiting with exit code 8 Action Make sure that the detector only reports states Online Offline Faulted Standby or NoReport e SCR 13 ExecScript Failed to exec script lt script gt for object lt nodename gt err
25. 3 Figure 46 Starting again with the Main configuration menu You can add more machines to the cluster at this point provided the required site preparation steps have been completed U42117 J Z100 4 76 79 Creating a second application Configuration example gt To add machines select RMS CreateMachine by entering the number 15 Follow the procedure described earlier and then return to the Main configu ration menu when finished From the Main configuration menu select Application Create as follows gt Select Application Create by entering the number 3 The Application type selection menu opens see Figure 47 Creation Application type selection menu 1 HELP 2 QUIT 3 RETURN 4 OPTIONS 5 DEMO 6 GENERIC 7 LIVECACHE 8 R3ANY 9 R3CI 10 RTP Application Type 6 Figure 47 Application type selection menu This time assign the GENERIC application type to the application This means that the GENERIC turnkey wizard will be in charge of the configuration procedure gt Select the GENERIC application type by entering the number 6 After the consistency check you are prompted to configure the basic settings APP2 is the default value for the application name If you want to change the name select 5 ApplicationName see Figure 48 80 U42117 J Z100 4 76 Configuration example Creating a second application Consistency check 1 HELP 2 NO SAVE EXIT 3
26. Action Correct the erroneous host condition for InitScript to run without stopping or fix the InitScript itself e INI 14 InitScript has been abnormally terminated InitScript has been abnormally terminated Action Correct the erroneous host condition for InitScript to run without stopping or fix the InitScript itself 288 U42117 J Z100 4 76 Fatal error messages MIS Miscellaneous 9 9 MIS Miscellaneous e MIS 4 The locks directory directory cannot be cleaned of all old locks files at call errno errnonumber error errortext The various RMS commands like hvdisp hvswitch hvutil and hvdump utilize the lock files from the directory directory for signal handling purposes These files are deleted after these commands are completed The locks directory is also cleaned when RMS starts up If they are not cleaned for some reason this message is the result RMS exits with exit code 99 The call indicates at which stage the cleanup has failed error number is the OS errno value errortext is the OS supplied explanation for the errno Action Make sure that the locks directory directory exists 9 10 QUE Message queues e QUE 1 Error status in ADMIN_Q Different utilities use the ADMIN_Q to communicate with the base monitor If there is an error with this queue this message is the result and RMS exits with exit code 3 Action Contact field support e QUE 2 Read message failed in ADMIN_Q Thi
27. Failed to load a detector of kind lt kind gt A detector was not able to be started by the RMS base monitor Action Make sure detector executable is present in the right place and has executable privileges e BAS 32 ERROR IN CONFIGURATION FILE Object lt object gt has no detector while all its children s lt MonitorOnly gt attributes are set to l An object without a detector has all its children s Moni torOn1y attributes set to 1 An object without a detector must have at least one child for which MonitorOn1y is set to 0 Action Change the configuration so that each object without a detector has at least one child with its MonitorOn1y set to 0 e BAS 36 ERROR IN CONFIGURATION FILE The object object has both attributes MonitorOnly and ClusterExclusive set These attributes are incompatible only one of them may be used Both attributes MonitorOnly and ClusterExclusive are set for the same RMS object Only one of them can be set for the same object Action Eliminate one or both settings from the RMS object object 8 4 BM Base monitor e BM 8 Failed sending message lt message gt to object lt object gt on host lt host gt When RMS encounters some problems in transmitting the message lt message gt to some other host in the cluster it prints this message This could be due to the fact that the RMS on the other host is down or there might be a network problem Action 232 U42117 J Z100 4 76
28. FaultScript ustioptireliant binitoo s d hvalert ANY ERROR Sysnode fuji2RMS faulted Qonline O wait ottiine Deact s ai Unknown Inconsistent Stand By Warning OfflineFault of rms amp pes Tsis msa Java Applet Window Figure 101 Stopping RMS 130 U42117 J Z100 4 76 Administration RMS procedures 2 Select the radio button for all available nodes and click Ok to shutdown RMS on all nodes Figure 102 EARMS Shutdown x Notes This Dialog will allow you to stop RMS on remote nodes Three options are available if you try to stop RMS on one node 1 Stop all UAPs 2 Keep local UAPs 3 Forced shutdown UAP user application Selection of 2 or 3 may break the consistency of the cluster 1 Shutdown RMS 8 all available nodes one node from the list Selection Sie Stop all UAPs J Stop all UAPs 30k Cancel lava Apple ind Figure 102 Stopping RMS on all available nodes U42117 J Z100 4 76 131 RMS procedures Administration 3 To shut down RMS on specific nodes select the radio button for one node from the list and then click the checkboxes of the nodes you want to shut down Figure 103 Each node has a dropdown list in the Options column to provide additional control e Stop all UAPs Stops all user applications for the selected node e Keep local UAPs Leaves the applications running
29. See also UP CF LEFTCLUSTER CF node state CF ENS CF See Event Notification Services CF environment variables Variables or parameters that are defined globally U42117 J Z100 4 76 361 Glossary error detection RMS The process of detecting an error For RMS this includes initiating a log entry sending a message to a log file or making an appropriate recovery response Event Notification Services CF This PRIMECLUSTER module provides an atomic broadcast facility for events failover RMS SIS With SIS this process switches a failed node to a backup node With RMS this process is known as switchover See also automatic switchover RMS directed switchover RMS switchover RMS symmetrical switchover RMS gateway node SIS Gateway nodes have an external network interface All incoming packets are received by this node and forwarded to the selected service node depending on the scheduling algorithm for the service See also service node SIS database node SIS Scalable Internet Services SIS GDS See Global Disk Services generating a configuration RMS The process of creating s single configuration file that can be distributed to all nodes affected by the configuration This is normally done automat ically when the configuration is activated using PCS the RMS Wizards or the CLI See also activating a configuration RMS distributing a configuration RMS GFS See Global Fi
30. Table 6 Switch processing activities During switch processing RMS notifies all hosts in the cluster of the procedure This prevents competing requests 6 7 2 Extreme situations during switch processing In rare cases fatal fault situations of varying severity can occur during switch processing for example the relevant host can crash or communication between the hosts can temporarily fail RMS resolves these situations by means of a complex scenario based on timeout handling recovery measures and by recalculation These measures are carried out transparently to the user It is important to realize that under extreme circumstances inconsistencies that RMS cannot resolve can occur in the cluster To minimize the damage in the case of data resources that are available for parallel access RMS blocks any further requests by entering a cluster wide loop state When the cause of the problem has been identified and cleared the system administrator must stop and restart RMS for the entire cluster This guarantees consistency by reinitial izing the internal RMS states Stopping and restarting RMS does not mean that all applications under the control of RMS have to be stopped RMS provides a command hvshut L that enables system administrators to stop RMS without performing offline processing for the applications The system adminis trator can then restart RMS while the applications are running see also the section Online processing
31. The RMS troubleshooting process usually begins after you observe an error condition or state change in Cluster Admin in one of the following areas e Clusterwide table e RMS tree U42117 J Z100 4 76 169 Overview Troubleshooting e Graph The clusterwide table contains summary information and is a good place to start looking for error conditions For additional details you can look at the RMS tree or the graph Depending on whether you need to look at the switchlogs or appli cation logs you can then use the log viewer facility to view the log files The log viewer has search facilities based on the following e Keywords e Severity e Non zero exit codes Search for causes of errors using the keywords and the date range fields For emergency alert and critical conditions you can do a search based on severity For proactive troubleshooting you can perform a search based on severity for the error warning notice and info severity codes Itis recommended that you periodically use the log viewer and check the i log files based on the severity levels to avoid serious problems If you cannot diagnose the cause of a problem look at the log viewer from two or more nodes in the cluster Refer to the section RMS troubleshooting on page 191 for an expla nation on corrective action Resolve error conditions as follows 1 Use the Cluster Admin GUI 2 View the log files if needed 3 Change log levels to get more
32. Using the Wizard Tools interface Configuration example Administration Advanced RMS concepts Troubleshooting Non fatal error messages Fatal error messages Console error messages Continued gt Appendix Operating system error numbers Appendix Object types Appendix Attributes Appendix Environment variables Appendix List of manual pages Glossary Abbreviations Figures Tables Index Contents Al dl dl dl dl do do dl de dl od od O ee ee ee ee ee ee oOahWNM Preface oo aoe a ee ae ee HE ee be 1 About this Manual 0 0000 ee eee 1 Related documentation a 2 Conventions a o 4 Notation v s ai e a a ze 4 Prompts a ong fi Andee a A Sie AA eS 4 Manual page sectionnumbers 0 5 The keyboard o 5 Typ fateS an aos us a A e RE e 5 Example li a o a a a ent ae eo Ao 5 Example 2 ii a A A eho di de 6 Command syntax o a 6 Important notes and cautions 7 Introduction ho e da A Re a 9 PRIMECLUSTER overview o o 9 How RMS provides high availability 10 Applications resources and Objects 10 Node and application failover 11 Controlled applications and controller objects 12 Follow controllers 0 a 13 Scalable controllers
33. cmp_Prio list This message is the result of the priority list ist having invalid entries Action Contact field support UAP 6 Could not add new entry to priority list Critical internal error Action Contact field support UAP 7 Could not remove entries from priority list Critical internal error Action Contact field support CUAP 8 object cpy_Prio failed source list corrupted 266 U42117 J Z100 4 76 Non fatal error messages UAP userApplication objects This message appears when either the PriorityList is empty or the list is corrupted Critical internal error Action Contact field support e UAP 9 object Update of PriorityList failed cluster may be in inconsistent condition If a contract that is supposed to be present in the internal list does not exist this message is the result The cluster may be in an inconsistent condition Action Contact field support e UAP 15 sysnode PrepareStandAloneContract processing unknown contract This message appears when there is only one application lt sysnode gt Online and has to process a contract that is not supported Critical internal error Action Contact field support e UAP 16 object SendUAppLockContract local host doesn t hold a lock Contract processing denied This message appears when the contract is processed by the local host which does not have the lock for that application contract Critical inte
34. e Detector does not detect the node as online within a specific period after the OnlineScript completes e Child of an AND node indicates a Fault e All children of an OR node signal Fault 6 4 5 userApplication is already online A situation can occur in which the entire logical graph of a userApplication is already online when RMS is initialized In this case the PreCheckScript does not execute and the affected nodes switch directly from the Unknown state to the Online state without executing any scripts Request while online Ifa userAppl ication receives an online request when it is already online it is forwarded to the other nodes as usual The only difference from the section Online processing on page 152 is that any nodes that are already online forward the request or the responses without executing their scripts and without changing to the Wait state A typical example of a node which is always online when RMS is initialized is a node for a physical disk node type disk since physical disks cannot be deconfigured Due to the property of the PXRE the physical disk can be deconfigured on Solaris No request while online If a userApplication does not receive an online request when it is already online and RMS is initialized no explicit online processing is carried out in the logical graph The userApp1ication however notifies its On ine state to the other RMS monitors on the other hosts in the cluster to ensure t
35. each detector periodically sends a heartbeat message to the base monitor When the heartbeat is missing for a period of time the base monitor prints this message into switchlog The base monitor will send an alarm signal to the stalled process to ensure the detector will properly handle its main loop respon sibilities If the amount of time stated since the last time the base monitor had received the heartbeat from the detector exceeds 300 seconds then the message may indicate the base monitor is not allowed to run Currently the base monitor is a real time process but not locked in memory This message may also occur because the bm process has been swapped out and has not had a chance to run again Action Make sure that the base monitor and detector are active using system tools such as truss 1 or strace 1 If the loss of heartbeat greatly exceeds the 300 second timeout then this may require that system swap or main memory is insufficient 8 10 GEN Generic detector e GEN 1 Usage command t time_interval k lt kind gt d Usage error for lt command gt Action Use the specified syntax for the command e GEN 2 Memory lock failed Action Critical error Contact field support e GEN 3 Cannot open command log file The file lt command gt log used for logging could not be opened Action 252 U42117 J Z100 4 76 Non fatal error messages INI init script Contact field support e GEN 4 failed to cre
36. lt explanation gt Action Correct the problem according to lt explanation gt 8 12 MIS Miscellaneous MIS 1 No space for object Action Critical error Contact field support 8 13 NOD Node detector NOD 6 Usage detector t time_interval Ifthe detector lt detector gt has been provided a non integer argument this message is the result and the detector exits with exit code 103 Action Provide an integer as the lt time_interval gt for the detector lt detector gt NOD 7 cluster host host is no longer in time sync with local node Sane operation of RMS can no longer be guaranteed Further out of sync messages will appear in the syslog The time on lt host gt is not in sync with the time on the local node Action Sync the time on lt host gt with the time on the local node NOD 8 Usage detector t time_interval d n If the argument t lt time_interval gt has not been provided for the detector lt detector gt or if an argument other than d or n is used this message is printed to the switchlog and the detector exits with exit code 103 Action Use the specified syntax for the invocation of the detector 254 U42117 J Z100 4 76 Non fatal error messages NOD Node detector e NOD 9 detector Failed to open req_queue The detector hvdet_node utilizes the queue req_queue for getting jobs from the base monitor If there is some problem with the queue this message
37. messages written to the console along with their causes and resolutions The chapter Appendix Operating system error numbers on page 321 lists operating system error numbers for Solaris and Linux The chapter Appendix Object types on page 323 lists all the object types that are supplied with RMS The chapter Appendix Attributes on page 325 lists the attributes that are required for each object type The chapter Appendix Environment variables on page 341 describes the RMS environment variables The chapter Appendix List of manual pages on page 349 lists the manual pages for PRIMECLUSTER 1 2 Related documentation The documentation listed in this section contains information relevant to PRIMECLUSTER and can be ordered through your sales representative Concepts Guide Solaris Linux Provides conceptual details on the PRIME CLUSTER family of products Installation Guide Solaris Provides instructions for installing and upgrading PRIMECLUSTER products Installation Guide Linux Provides instructions for installing and upgrading PRIMECLUSTER products Web Based Admin View Solaris Operation Guide Provides information on using the Web Based Admin View management GUI Web Based Admin View Linux Operation Guide Provides information on using the Web Based Admin View management GUI Cluster Foundation CF Solaris Configuration and Administration Guide Provides instru
38. object lt object gt is of type and its state is online but not all children are online This message may appear during dynamic modification when the existing configuration is checked before applying the modification If this message appears the dynamic modification will not proceed Action Make sure that online objects of type and have all their children in online states only then apply dynamic modification e BAS 29 ERROR IN CONFIGURATION FILE object lt object gt cannot have its HostName attribute set since it is not a child of any userApplication An object that is not a child of a userApplication has its HostName attribute set Only children of the userApplication object can and must have its HostName attribute set Action Eliminate the HostName attribute from the definition of the object or disconnect the userApp1ication object from this object making this object a child of another non userApp1ication object e BAS 30 ERROR IN CONFIGURATION FILE The object object has both attributes LieOffline and ClusterExclusive set These attributes are incompatible only one of them may be used Both attributes LieOffline and ClusterExclusive are set for the same RMS object Only one of them can be set for the same object Action Eliminate one or both settings from the RMS object object U42117 J Z100 4 76 231 BM Base monitor Non fatal error messages e BAS 31 ERROR IN CONFIGURATION FILE
39. online request hvswitch f lt userApplication gt lt target_host gt U42117 J Z100 4 76 163 Fault processing Advanced RMS concepts The userApp l ication starts online processing and assuming that the fault is cleared resets the application to the Online state A forced online request will fail if fault processing has failed or if the i PreserveState attribute was set In these cases it is likely that individual nodes will be in an undefined Wait state in which RMS cannot process an online request to ensure consistency Forced online request A forced online request can be sent to a target host that is not the host on which the application was running when the fault occurred This would be an instance of a forced switchover Again the forced request will not be successful if fault processing failed on the previous host a forced switchover does not automati cally mean a forced offline request on the previous host If fault processing succeeded on the previous host the forced online request has the same effect as a local forced online request sent to the target host Online processing is initiated even if individual nodes are faulted 6 6 5 SysNode faults RMS handles a fault that occurs in a SysNode in a different manner than faults in any other type of resource node A SysNode fault occurs under the following conditions e SysNode detector loses contact with RMS e SysNode detector loses contact with a clu
40. to its expected usage this message is printed and the utility exits with exit code 6 Action Follow the usage specified above e Usage hvcm V L a s targethost c config_file L m h time 1 level r count L w time Usage is not correct Action Check the hvcm man page for correct usage e Usage hvconfig 1 o config_file An attempt to use the hvconfig utility in a way that does not conform to the expected usage leads to this message and the utility exits with exit code 6 Action Follow the expected usage for the utility 316 U42117 J Z100 4 76 Console error messages Console messages in alphabetical order e Usage hvdisp a c h i 1 n S resource_name u c z resource_name T resource_type u c u resource_name ENV ENVL o out_filel An attempt to use the hvdi sp utility in a way that does not conform to the expected usage leads to this message and the utility exits with exit code 6 Action Follow the expected usage for the utility e Usage hvdump g f out_file t wait_time An attempt to use the hvdump utility in a way that does not conform to the expected usage leads to this message and the utility exits with exit code 6 Action Follow the expected usage for the utility e Usage hveject s host An attempt to use the hveject utility in a way that does not conform to the expected usage leads to this message and the utility ex
41. 2068 app2 RMS Attribute Value MonitorOnly 0 o ae Cmd_APP2 Operator AND ae CtLAPP2 Resource appt 5 Controlieroo1 Of Cti APF lignoreomlineRequest 0 o Non afillted IgnoreOnlineRequest_ 0 IgnoreStandbyRequest 1 o Ig ker appt OnlineTimeout 0 MO irus StandbyTimeout o app2 SplitRequest o JH apps AutoRecover 0 1 1 Follow 1 AutoRecoverCleanup 0 IndependentSwitch 0 StandbyCapable 0 0 Ci NoDisplay Affiliation t1_APP2 Scripts ScriptTimeout 180 FaultScript Jusrioptreliantbinitools d hvalert Ct APP2 ERROR Cannot manage application ZA it Online Offline Deact Faulted Unknown Inconsistent stand By Warning Offline Fault Tor rmsanes Java Applet Window Figure 70 RMS tree with a controller object 98 U42117 J Z100 4 76 Administration Using Cluster Admin 5 2 4 2 Configuration information or object attributes View the configuration information for the individual objects by left clicking with the mouse on the object in the tree The properties are displayed in a tabular format on the right hand side panel of the RMS main window Figure 71 EA Cluster Admin 01 A A File Tools Preferences Help J Fu Attributes HO mares fujiZRMS System Node D aj RMS Attribute 5 5 Value icmComm wem c mydemo o ae CmuLARE2 MonitorOnly 0 ae CtlAPP2 NoDisplay 0 65 controneroo1or_ct_APR Scripts gt a Non affiliate
42. 34 Figure 35 Figure 36 Figure 37 Figure 38 Figure 39 Figure 40 Figure 41 Figure 42 Figure 43 Figure 44 Figure 45 Figure 46 Figure 47 Figure 48 Figure 49 Figure 50 List of nodes for failover procedure Machines Basics menu for additional nodes AutoSwitchOvermode o Setting flags for AutoSwitchOver mode Saving settings 0 Non basic settings o Prompting for display specification List of display options Successful consistency check forAPP1 Turnkey wizard DEMO 208 Global settings main menu Global settings machines menu Global settings Alternatelps first menu Global settings Alternatelps second menu Global settings Alternatelps first menu with first interface Global settings Alternatelps first menu with both interfaces Global settings main menu with Alternatelps for first host Global settings main menu with Alternatelps for both hosts Main configuration menu Successful configuration activation Quitting the Main configuration menu Starting again with the Main configuration menu Application type selection menu Prompting for further specification Machines Basics menu o List of nodes for failover procedure 65 65 66 66 67 68
43. 5 2 Offline processing in a logical graph of a userApplication 157 6 5 3 Fault situations during offline processing 158 6 5 4 Node is already offline 159 6 5 5 Node does not have an Offline state 159 6 6 Fault processing o e 159 6 6 1 Faults in the online state or request processing 159 6 6 2 Offline fault awe wa a Ae Ge eS 162 6 6 3 AutoRecover attribute o 162 6 6 4 Fault clearing ace a a e ao a a 163 6 6 5 SysNode faults 0 o ee 164 6 6 5 1 Operator intervention o 165 6 7 Switch processing o 165 6 7 1 Switch request 0 e e o 165 6 7 2 Extreme situations during switch processing 167 7 Troubleshooting o 169 7 1 OVOIVIOW 40 ee ee a Ae Ge a ee 169 7 2 Debug and error Messages o 171 7 3 LOTES ase cn Poe ra e RAT a AA 172 7 4 Using the log Viewer 0 a 174 7 4 1 Search based on resource 177 7 4 2 Search based on time aoaaa o 178 7 4 3 Search based on keyword aoaaa a 179 7 4 4 Search based on severity levels 180 7 5 Using the hvdump command aaa a 181 7 6 Specifying the log level o 182 7 7 Interpreting log files aoaaa a 185 7 8 System log 000 a ir dd da we 186 7 9 Wizard log files
44. 5 3 11 5 3 12 5 3 13 6 1 1 6 1 2 6 1 3 Administration 0 91 OVEIVIOW o ek ere ei ey eee Bak PAR ate ha eae 91 Using Cluster Admin 2 a 91 Starting Cluster AdM N o o 91 Logging INi bovirroca ta ns e pain a nate kee AE a a da da 93 Mainiscreen hiri 8 e a eee le e dede de de eed 95 RMS mainwindow e 97 RMS tree uri ee de a a ee 97 Configuration information or object attributes 99 Command pop ups 2 0 0 eee ee ees 100 Confirmation pop ups 22 020000 102 Switchlogs and application logs 103 RMS graphs a a acd ne a a ad GAR eel hak AR ed 108 RMS fullgraph 2 a 108 Application graph 1 ee ee 110 Subapplication graph o o 111 Composite subapplication graph 112 Configuration information from a graph 113 Command pop ups e o 114 RMS graph customization 115 Node status after RMS is shut down 118 RMS clusterwide table 0 119 Command POP UPS o o 122 Changing the RMS configuration 123 RMS procedures 2 000 eee ee ee 125 Starting AMS a oa aoe eA apea A Eee ee ee a a 126 Stopping RMS a a a ea e e oa a 0 1 1 130 Starting an application o oo a a a 134 Switching an application o oo a a 136 Taking an applica
45. 69 70 71 72 73 74 74 74 75 75 75 76 77 77 78 79 80 81 81 82 380 U42117 J Z100 4 76 Figures Figure 51 Machines Basics menu 82 Figure 52 Non basicsettings o 83 Figure 53 Assigning a controller 84 Figure 54 List of applications to be chosen as controlled applications 84 Figure 55 Menu for setting controller flags 85 Figure 56 Changing controller timeout period 85 Figure 57 Saving flags for controller 86 Figure 58 Indication of flags set for controller 86 Figure 59 Menu with settings for GENERIC turnkey wizard 87 Figure 60 Main configuration menu 87 Figure 61 Main configuration menu 88 Figure 62 Activating the configuration for the second time 88 Figure 63 Return to Main configuration menu 89 Figure 64 Invoking the Cluster Admin GUI 92 Figure 65 Web Based Admin View login screen 93 Figure 66 Topmenu 0208 94 Figure 67 Cluster menu a 95 Figure 68 Main screen e 96 Figure 69 RMS main window 0 ae 97 Figure 70 RMS tree with a controller object 98 Figure 71 Configuration information or object attributes 99 Figure 72 Command pop up o e e 100 Figure 73 Command pop up for an offline applicat
46. 76 Non fatal error messages ADM Admin command and detector queues Action Contact field support e ADC 63 Error lt errno gt while reading file lt file gt reason lt reason gt While reading file lt file gt an error lt errno gt occurred explained by lt reason gt File reading errors may occur during dynamic modification or during hv join operation Action Contact field support e ADC 68 Error lt errno gt while opening file lt file gt reason lt reason gt While opening file lt file gt an error lt errno gt occurred explained by lt reason gt File open errors may occur during dynamic modification Action Verify the file existence and reissue dynamic modification request e ADC 70 Message sequence is out of sync File transfer of file lt filename gt has failed Critical internal error Action Contact field support 8 2 ADM Admin command and detector queues e ADM 3 31 Dynamic modification failed some resource s supposed to come offline failed During dynamic modification when new resource s that are to be added to a parent object that is offline cannot be brought offline this message is the result Action Make sure the new resource s can be brought to the offline state and reissue the hvmod command U42117 J Z100 4 76 205 ADM Admin command and detector queues Non fatal error messages e ADM 4 30 Dynamic modification failed som
47. Ea Keyword Filter Severity No Selection Y Non zero exit code Keyword Filter NaroptSMAWRrmsilog switchlog FUJITSU SIEMENS RELIANT MONITOR SERVICES RMS 4 1412 2003 09 10 11 08 05 816 BM 43 NOTICE Package parameters for rms package lt SMAWRrms gt 2003 09 10 11 08 05 881 BM 43 NOTICE Package parameters for wizard package lt SMAVYRhvto 2003 09 10 11 08 05 893 BM 87 NOTICE The Process Id pid of this RMS Monitor is lt 2385 gt Status Done Figure 76 Viewing the RMS switchlog file U42117 J Z100 4 76 103 Using Cluster Admin Administration The Detach button will separate the switchlog tab so you can view it in its own window Figure 77 The detached window can be rejoined to the main window with the Attach button amp var opt reliant log switchlog on fujiZRMS Time Filter pene Satin E dal eb TENE E C End Time 2003 E o Mhs Ep e lao Ela Keyword Filter Severity No Selection Y Non zero exit code Keyword E Fitter NarlopSMAWRrms log switchlog FUJITSU SIEMENS RELIANT MONITOR SERVICES RMS 4 14 2003 09 10 11 08 05 816 BM 43 NOTICE Package parameters for rms package lt SMAWRrms gt Z 2003 09 10 11 08 05 881 BM 43 NOTICE Package parameters for wizard package lt SMAVYRhytd Status Done Attach Close Help lava Applet Window Figure 77 Viewing the RMS swi
48. Each dot represents an application or resource that has been successfully generated Configuration Generate provides a way to generate and check a configuration without distributing it to the other nodes in the cluster This may be useful for testing or debugging Normally you would use Configuration Activate described below to generate and activate the configuration in one step i Configuration Generate is always available whether RMS is running or not e Configuration Activate Generates and activates a configuration Selecting this item performs both the generation and activation phases in one step The generation phase is described above The activation phase prepares the cluster for RMS ensuring that all the required data is put into place The wizard distributes the configuration data to every node and installs all necessary files Configuration Activate is not available if RMS is already running on one or more nodes e Configuration Push Distributes a complete copy of the running configuration to a specific cluster node When a configuration is activated some nodes may not be available This menu item allows you to update individual cluster nodes that are brought up later when RMS is already running Configuration Push is available only after the configuration has been activated e Configuration Copy Produces a copy of an existing configuration 44 U42117 J Z100 4 76 Using the Wizard Tools interface Cr
49. Fault clearing After successful fault processing the resource nodes will be offline and the userApplication will be faulted If offline processing fails as a result of the fault or if the PreserveState attribute were used at least part of the graph will be in a Wait state In all of the above states the userAppl1ication blocks the normal requests such as a switch request since the base monitor assumes that at least some of the resources are not available RMS can only resume normal operation after the system administrator has remedied the cause of the fault The following options are available for notifying the base monitor that the cause of the fault has been cleared fault clearing 1 After clearing the fault condition the system administrator can use the following command to send a clear fault request to the userApplication hvutil c lt userApplication gt This then starts further offline processing If the fault has cleared the entire tree will be offline If required the system administrator can reset the userApplication to the Online state with a switch request Invoking hvutil c results in further online processing if the fault i occurs below an orOp node In such cases the node and its parents up to the OR node are faulted However the fault has not been forwarded to the userApplication The userApplication will thus still be online 2 The system administrator can use the following command to make a forced
50. Fault that occurred in the past has not yet been cleared U42117 J Z100 4 76 21 RMS components Introduction The following resource states may also be displayed in the GUI status area Wait Temporarily in transition to a known state An action has been initiated for the affected resource and the system is waiting for the action to be completed before allocating one of the above states Unknown No information is available Usually reported before object initialization is completed Deact Applies to userApplication objects only Operator intervention has deactivated the application throughout the cluster such as for maintenance purposes Inconsistent Applies to userApp1ication objects only The object is Offline or Faul ted but one or more resource objects in its graph are Online or Faulted The interpretation of Of f1 ine and Faul ted may depend on the resource type For instance a mount point resource can be either Onl ine mounted or Offline not mounted in this case the detector would never report the Faul ted state On the other hand a detector for a physical disk can report either Onl ine normal operation or Faulted input or output error it would never report Offline Detectors for common system functions are provided by the Wizard Tools Additional application specific detectors are included with the Wizard Tools and the Wizard Kit 2 6 3 Scripts RMS uses scripts to perform actions such as moving a r
51. RELIANT_PATH gt build Work area for configuration files lt RELIANT_PATH gt etc Files that control the RMS environment lt RELIANT_PATH gt include RMS include files header files used by detectors and configuration files lt RELIANT_PATH gt 1ib RMS runtime libraries lt RELIANT_PATH gt us RMS source files The names of the files in this directory are reserved and should not be used to name any configuration files that the user may create Table 2 RMS base directory structure As summarized in Table 3 RMS log files are located in the directory specified in the RELIANT_LOG_PATH environment variable U42117 J Z100 4 76 29 Directory structure Introduction Name Contents RELIANT_LOG_PATH Contains files that can be used for RMS analyzing and debugging Detectors and userApplication objects create log files here when they are started Default var opt SMAWRrms 1log Table 3 Log directory structure 30 U42117 J Z100 4 76 3 Using the Wizard Tools interface This chapter describes how to configure high availability for customer applica tions using the RMS Wizards e The section Overview on page 31 gives a brief overall description of the configuration process and the RMS Wizards e The section Site preparation on page 34 describes the modifications to system files that are required for proper RMS operation e The section General con
52. RMS base monitor shuts down if all of the following conditions are true 1 The base monitor has encountered a different checksum from a remote monitor within the initial startup period defined by HV_CHECKSUM_INTERVAL 2 There are no applications on this node that are On1 ine waiting busy or locked 3 There are no online remote base monitors encountered by this base monitor Otherwise the base monitor keeps running but all remote monitors whose checksums do not match the local configuration checksum are considered to be Offline so no message exchange is possible with these monitors and no automatic or manual switchover will be possible between the local monitor and these remote monitors When different checksums are encountered certain messages are placed in the switchlog explaining the situation Action Verify the problem by using hvdisp a on the remote nodes to find out the actual configuration files Compare the checksum of these configu ration files The hvdisp command does not require root privilege If the base monitor does not shut down on its own but keeps running because one of the above conditions is not true the system administrator may need to do the following 1 Shut down certain base monitors 2 Find out which configuration to run 3 Distribute this file with hvdist 4 Stop and restart RMS on the entire cluster so that all cluster nodes run the same configuration e RMS hangs after
53. RMS starts This occurs if the AutoStartUp or AutoSwitchOver attributes are set only priority requests The system administrator generates a switch request by X X affected applications determines the target host i For priority request the configured priorities of the hosts relative to the condition which would prevent the application from going online the userApp1ication communicates with its complementary node on the target host RMS functions as follows if such a fault condition exists Terminates e switch processing directed switch e Identifies the next host in priority as the new target host priority switch If no new target host is identified RMS terminates switch processing RMS forwards the request to the host on which the X userApplication node is currently online RMS forwards the request to the host on which X userApplication is to go online To establish whether its local graph contains a fault X Table 6 Switch processing activities 166 U42117 J Z100 4 76 Advanced RMS concepts Switch processing Activity Switch Switch over online The userApplication carries out local offline X processing stops and thus deconfigures the ongoing application The userApplication transmits the online request to X the corresponding node on the target host The userApplication on the target host carries out X X local online processing
54. SAVE EXIT 4 REMOVE EXIT 5 ApplicationName APP2 6 BeingControlled no 7 Machines Basics Choose the setting to process 7 Settings of turnkey wizard GENERIC Yet to do process the basic settings using Machines Basics Yet to do choose a proper application name Figure 48 Prompting for further specification gt Select Machines Basics by entering the number 7 The consistency of APP2 is checked and the result is positive When the Machines Basics menu appears it shows that APP2 is initially configured to run on fuji2RMS see item 7 Machines 0 in Figure 49 Consistency check Machines Basics app2 consistent 1 HELP 2s 3 SAVE EXIT 4 REMOVE EXIT 5 AdditionalMachine 6 AdditionalConsole 7 Machines 0 fuji2RMS 8 PreCheckScript 9 PreOnlineScript 10 PostOnlineScript 11 PreOfflineScript 12 OfflineDoneScript 13 FaultScript Choose the setting to process 5 14 15 16 17 18 19 20 21 22 23 24 25 AutoStartUp n0 AutoSwitchOver No PreserveState no PersistentFaul t 0 ShutdownPriority OnlinePriority StandbyTransitions LicenseToKill no AutoBreak yes HaltFlag no PartialCluster 0 ScriptTimeout Figure 49 Machines Basics menu U42117 J Z100 4 76 81 Creating a second application Configuration example gt Select AdditionalMachine by entering the number 5 A menu appears with the list o
55. The scripts of the leaf nodes execute first during online processing The system then waits until the node changes to the On1 ine state On the other hand nodes with children do not need an online script if they can be brought online in the Onl ineScript of a child Resource nodes that cannot go offline due to physical reasons such as physical disks are an exception to the rule that leaf nodes require online scripts These nodes are identified in RMS configurations with the attribute LieOff1ine 1 refer also to the section Node does not have an Offline state on page 159 In RMS the userApplication is online means that all configured resources are online ready to operate In this case the term online does not pertain to the state of the actual application The actual appli cation is either not controlled by RMS at all or it is started in the online script and possibly in the post online script of the userApplication 154 U42117 J Z100 4 76 Advanced RMS concepts Online processing Even in the latter case the userApp1ication is online only means that this script has been completed successfully Whether and to what extent this fact permits statements to be made as to the state of the application is decided exclusively in the application and cannot be influenced by RMS 6 4 3 PreCheckScript Before online processing begins the PreCheckScript determines if online processing is needed or even possible This procedure is
56. a component By default the cf tab is selected The Cluster Admin GUI has some standard components that are common across RMS CF SIS and the message window They are as follows e Pull down menus Pull down menus that contain both functions generic to the Admin GUI and specific to the PRIMECLUSTER products e Tree panel Panel on the left is normally the tree panel This panel displays product specific configuration information Click on a tree component to view further information in the main panel e Main panel Large panel on the right is the main work and information area The content varies according to the product being administered and the functions selected from the menus or tree 96 U42117 J Z100 4 76 Administration Using Cluster Admin 5 2 4 RMS main window To start the RMS portion of the GUI click on the rms tab An example of the RMS main window is shown in Figure 69 The main window area is split into two sub areas The RMS tree is displayed on the left hand side panel The right hand side panel is used to display configuration information or properties of nodes logs or both depending on the selections in the RMS tree ES Cluster Admin olx IBRRIMECLUSTER File Tools Preferences Help Bru attributes EMO turns fuji2RMS System Node mp2 RMS Attribute Value of icmComm hvem c mydemo y e appt MonitorOnly 0 MO wirms NoDisplay 0 o Q e app2 Scripts o 0 app
57. activation phase it displays status information as described in the section Activating a configuration on page 49 You will be prompted to continue at the end of the process see Figure 44 The new configuration was distributed successfully About to put the new configuration in effect done The activation has finished successfully Hit CR to continue Figure 44 Successful configuration activation U42117 J Z100 4 76 77 Activating the configuration Configuration example gt Press the Enter or Return Figure 45 key to return to the Main configuration menu No RMS active in the cluster 1 HELP 2 QUIT 3 Application Create 4 Application Edit 5 Application Remove 6 Application Clone 7 Configuration Generate 8 Configuration Activate 9 Configuration Copy Choose an action 2 fuji2 Main configuration menu curren 0 1 2 3 4 5 6 7 t configuration mydemo Configuration Remove Configuration Freeze Configuration Thaw Configuration Edit Global Settings Configuration Consistency Report Configuration ScriptExecution RMS CreateMachine RMS RemoveMachine Figure 45 Quitting the Main configuration menu gt Select QUIT by entering the number 2 This ends the activation phase of the configuration process At this point RMS may be started to monitor the newly configured application 78 U42117 J Z100 4 76 Configurat
58. alphabetical order Console error messages e hvutil RMS is not running on lt targethost gt Printed when hvutil A targethost is called indicating that RMS is not running on the named host Action None required e hvutil RMS is running on lt targethost gt Printed when hvutil A targethost is called indicating that RMS is running on the named host Action None required e hvutil The resource lt resource gt does not have a detector associated with it The resource resource does not have a detector Action Issue hvutil N on a resource which has a detector e hvutil The resource lt resource gt is not a valid resource The resource resource is not a valid resource Action Issue hvutil N on a resource which has a detector and is part of the resource graph e hvutil time period of detector must be an integer If the detector time period specified as an argument with hvutil t is not a number hvutil is aborted and exits with exit code 6 Action Make sure that the detector time period is an integer 306 U42117 J Z100 4 76 Console error messages Console messages in alphabetical order e hvutil Unable to open the notification file lt path gt due to reason reason hvutil was unable to open the file path because of reason Action Contact field support e Invalid delay If the delay specified for sending a message using hvsend is a number less than zero this message is
59. attribute is set If the LOH timestamp entries of the userApp1 ication on two hosts differ by less than this time interval RMS does not perform AutoStartUp and does not allow priority switches Instead it sends a message to the console and waits for operator intervention When adjusting this variable the quality of the time synchronization in the cluster must be taken into account The value must be larger than any possible random time difference between the cluster hosts U42117 J Z100 4 76 343 Global environment variables Appendix Environment variables e HV_WAIT_CONFIG Possible values O MAXINT Default 120 seconds Interval in seconds during which RMS waits to receive a configuration from an online host if RMS starts up as hvcm C If the configuration is not received within HV_WAIT_CONFIG seconds the local monitor will attempt to run with the configuration file specified in RELIANT_BUILD_PATH f such a file does not exist the local monitor will continue to run with the minimal configuration in this case it is possible for it to join an already running RMS cluster via hvjoin e RELIANT_LOG_LIFE Possible values Any number of days Default 7 days Specifies the number of days that RMS logging information is retained Every time RMS starts the system creates a directory that is named on the basis of when RMS was last started and which contains all constituent log files All RMS log files are preserved i
60. bmlog description 172 browser 91 build directory 29 Cc CF commands cfconfig 349 cftool 349 cfset 349 cfsmntd 350 CIP commands cip cf 350 cipconfig 350 ciptool 350 Class attribute 335 clbackuprdb 352 clearing faulted resources 25 Faulted state 140 faults 140 hung nodes 25 SysNode Wait state 140 clgettree 353 CLI options 125 See RMS CLI clinitreset 352 clrestorerdb 353 clsetparam 353 clsetup 353 clstartrsc 353 clstoprsc 353 cluster 1 high availability 10 services 9 switching user applications 23 Cluster Admin 20 125 administrative privileges 93 application graph 110 application log files 103 clusterwide table 119 command pop ups 100 configuration 113 graph 112 GUI 91 logging in 93 main screen 95 object attributes 99 operator privileges 93 overview 10 primary management server 92 RMS graphs 108 RMS main window 97 RMS tree 97 root privileges 93 searching log text 107 secondary management server 2 starting 91 switchlog 103 switchlog panel 104 SysNode selection 100 userApplication selection 100 using 91 388 U42117 J Z100 4 76 Index cluster file system 9 Cluster Foundation LEFTCLUSTER 164 cluster node detector timeout for remote 345 ignore at startup 341 wait to report online 342 cluster volume management 9 ClusterExclusive attribute 326 clusterwide table 119 Cmdline resource wizard 33 command pop ups clusterwide table 122 RMS graph 114 RMS tree 100 commands
61. code Critical internal error Action Contact field support e UAP 35 object inconsistency occurred Any further switch request will be denied except forced requests Clear inconsistency before invoking further actions This message appears when the state of the application is offline or standby and some of the resources are online and faulted U42117 J Z100 4 76 269 US us files Non fatal error messages Action Clear the inconsistency by the appropriate command usually hvutil c UAP 41 cannot open file filename Last Online Host for userApplication cannot be stored into non volatile device File open error Action Check the reliant path UAP 42 found incorrect entry in status file gt entry lt This message appears when the status_info file has incorrect entry in it This should occur only if the status info file was edited manually Action Check the status info file for manual incorrect entries If this is not the case contact field support UAP 43 lt object gt could not insert lt host gt into local priority list Action Critical error Contact field support CUAP 44 lt object gt could not remove lt host gt from local priority list Action Critical error Contact field support UAP 45 lt object gt could not remove lt host gt from priority list Action Critical error Contact field support 8 19 US us files US 5 The cluster host hostname is no
62. command starts RMS with the configuration file specified by the c option If no c is present RMS uses the default startup file CONFIG rms The hvcm command starts the base monitor and the detectors for all monitored resources In most cases it is not necessary to specify options to the hvcm command the default values are sufficient for most configurations U42117 J Z100 4 76 129 RMS procedures Administration The default startup file CONFIG rms is located in RELIANT_PATH etc If the default for the environment variable RELIANT_PATH has not been changed RMS searches for CONFIG rms in the default root directory opt SMAW SMAWRrms etc The system default run level in etc inittab must match the chosen i RMS start run level otherwise the start sequence may be out of order To verify or to change the RMS run level use the hvrclev command Refer to the chapter Appendix List of manual pages on page 349 for more information 5 3 2 Stopping RMS 1 Use the Tools pull down menu or right click on a system node and select the mode of shutdown in the subsequent option screen Figure 101 EA Cluster Admin 01 Ter Adri File Tools Preferences Help 5 Start Rms ts Shutdq yn RMS System Node View switchlog ibute Value hvem c mydemo PRIMECLUSTER Configuration Services PCS iy 0 Ese Rms NoDisplay 0 D an Scripts o D appt ScriptTimeout 300
63. components of the Wizard Kit For information on the availability of the RMS Wizard Kit contact your i local customer support service or refer to the RMS Wizards documen tation package See the section Further reading on page 55 for more information 3 1 1 1 Turnkey wizards Turnkey wizards provide predefined structures of resources to monitor almost every basic operating system object This relieves the user of the tedious task of linking system resources according to their dependencies 32 U42117 J Z100 4 76 Using the Wizard Tools interface Overview Many turnkey wizards are designed to configure a specific type of application The configuration described in the chapter Configuration example on page 57 uses the DEMO and GENERIC turnkey wizards Other examples are the R 3 wizard and the ORACLE wizard By convention turnkey wizards have names with all uppercase letters 3 1 1 2 Resource wizards Resource wizards sometimes called sub application wizards configure lower level resources such as file systems or IP addresses They are invoked by turnkey wizards and are not designed to interact directly with the user Resource wizards have names that begin with one uppercase letter followed by one or more lowercase letters The following are some of the more important resource wizards Cmdline Configures any generic resource type by specifying StartScript to bring the resource online StopScript to send
64. control of individual objects For instance RMS would process one script when a resource reports a transition from the Online state to the 0ff1 ne state however RMS would process a different script when it must force the resource to the Offline state Internally RMS represents a user application and all of its resources as a userApplication object Bringing a userApp1ication to the online state along with all of its dependant resources is called online processing Taking a userApplication to the offline state along with all of its dependant resources is called offline processing Machines that are members of a cluster are called nodes Nodes that may run applications are represented by RMS SysNode objects Like resource and application objects each SysNode has an associated set of scripts and dependent resources 2 2 2 Node and application failover During normal operation one instance of RMS runs on each node in the cluster Every instance communicates with the others to coordinate the actions configured for each userApp1ication If a node crashes or loses contact with the rest of the cluster then RMS can switch all userApp1ication objects from the failed node to a surviving node in the cluster This operation is known as failover Failover can also operate with individual applications Normally a userApp1i cation object is allowed to be online on only one node at a time Exceptions to this rule are shared objects like Oracle R
65. dfstab Contains entries for all of the shared remote resources in the high avail ability configuration In other words this file describes the file systems that can be mounted on a remote node RMS entries appear as comments and will be ignored by all processes other than PRIMECLUSTER components Therefore to ensure that the NFS daemons start at boot time there must be at least one non comment non RMS entry in this file The non RMS entry might be a dummy entry configured for a local file system and shared only to the local node This would mean that no real sharing to a remote node is done but it would still cause the NFS daemons to be started For more information see the dfstab manual page Example The following contains both a non RMS entry and an RMS entry share F nfs 0o ro localhost var opt example RMS share F nfs o rw root fuji2RMS fuji2 045nfs045dial 045msg fuji2RMS sapmnt 045 3 2 2 1 NFS Lock Failover Solaris only NFS Lock Failover feature applies to local file systems If you enable NFS Lock Failover and the file system subsequently fails the NFS locks associated with the file system also fail over along with the file system To take advantage of this feature the following site preparation steps need to be taken You must have a shared disk accessible to all nodes in the cluster Internal implementation of NFS Lock Failover needs a dedicated directory You need to specify a directory that does no
66. dkmigrate 352 dkmirror 352 dktab 352 documentation additional 2 wizards 55 E echo service 39 ENV attributes 323 description 323 ENV and ENVL objects 27 environment variables 27 displaying 142 HV_AUTOSTART_WAIT 342 HV_AUTOSTARTUP_IGNORE 341 HV_CHECKSUM_INTERVAL 342 HV_CONNECT_TIMEOUT 345 HV_LOG_ACTION 346 HV_LOG_ACTION_THRESHOLD 343 HV_LOG_WARN_THRESHOLD 343 HV_MAXPROC 346 HV_RCSTART 346 HV_REALTIMEPRIORITY 346 HV_SCRIPTS_DEBUG 347 HV_SYSLOG_USE 347 HV_WAIT_CONFIG 344 RELIANT_HOSTNAME 347 RELIANT_INITSCRIPT 348 RELIANT_LOG_LIFE 344 RELIANT_LOG_PATH 171 344 RELIANT_PATH 344 RELIANT_SHUT_MIN_WAIT 345 RELIANT_STARTUP_PATH 348 SCRIPTS_TIME_OUT 348 ENVL attributes 323 description 323 error messages 169 base monitor 171 console 295 fatal 281 switchlog 195 errors at initialization 157 during offline processing 158 in offline state 162 reactionto 159 etc directory 29 F failover 11 fatal error messages 281 fault clearing 163 fault script 160 Faulted state 21 clearing 140 FaultScript 23 faults clearing 140 failover 64 FaultScript attribute 327 script 23 file systems as application resources 16 31 filling up 191 Fsystem 33 resource type 19 site preparation 34 warning threshold 343 fisvwvbs 356 fisvwvenf 356 Follow mode controllers 13 Follow attribute 327 forced shutdown 132 forced switchover 164 forced online requests 163 fsck_rcfs 350 390 U42117 J Z100 4 76 Index Fsystem resou
67. do the following e Define the set of resources used by the application including Disks Volume managers File systems processes to be monitored IP addresses e Define the relationship between each resource and its dependant resources e g which file system depends on which virtual or physical disk which processes depend on which file systems and so forth e Define the relationship between the applications being controlled for example which applications must be up and running before others are allowed to start e Provide scripts to bring each resource online and offline e Provide a detector to determine the state of each resource Configuring the above set of requirements by hand can be quite time consuming and prone to errors This is why the RMS Wizard Tools were developed The PRIMECLUSTER RMS wizards allow the creation of flexible and quality tested RMS configurations while minimizing your involvement A simple user interface prompts you for details regarding your applications and resources 16 U42117 J Z100 4 76 Introduction How RMS Wizards provide easy configuration Using these details the wizards automatically select the proper scripts and detectors and combine them in a pre defined structure to produce a complete RMS configuration Specialists skilled in popular applications and in RMS worked together to create the RMS Wizards The wizards are designed to easily configure RMS for certain
68. fail in certain combinations In the banking scenario for example the teller application depends on a network and the database application depends on a local file system Suppose the file system on node1 fails and the database goes offline If the database controller is operating in Follow mode RMS will attempt to switch the teller and database to node2 However if the network on node2 is offline or faulted the teller can t be brought online there so the teller application is prevented from running on either nodel or node2 This will not happen if the controlled database application operates in Scalable mode If the network is online on nodel and the file system is online on node2 then the database can be switched independently as shown in Figure 5 14 U42117 J Z100 4 76 Introduction How RMS provides high availability database switchover Ye database application Figure 5 Scalable mode controlled child application switchover Conversely a network outage could cause RMS to switch the teller to node2 while leaving the database online on nodel as shown in Figure 6 teller switchover teller application database application Figure 6 Scalable mode controlling parent application switchover As noted earlier RMS allows only one instance of an application in a cluster That is an application can run on only one node at a time However controller objects do not have the same re
69. for a SysNode object selection and userAppl ication object selection lt also offers different options for a userAppl ication object in the online state than in the offline state Figure 73 100 U42117 J Z100 4 76 Administration Using Cluster Admin ES Cluster Admi File Tools Preferences Help app2 on fuji2RMS User Application 0 View Subapplication Graph View Composite Subapplication Graph UserApplicationtapp2 o o hvswitch b fown 0 Oe a Forced switch hvswitch 1 512 o apl priority switch twswitch 0 o D a fuji3RMS fuji2Rms Offline hwutil f o Clear fault hwutil c 0 NUDISpTAy i Affiliation lapp2 Comment app2 15974 2003 09 04 17 22 02 Scripts ScriptTimeout 300 PreCheckScript hvexec p app2 mydemo PreOnlineScript tm f usroptrelianttmp app2 goingoffline if HHV_INTENDED_STATE On PreOfMlineScript hvenable app2 ALL rm f usrioptireliantitmplapp2 online touch Justoptireliantt rm f usr optrelianttmp app2 goingoffline KI OMineDoneScript Online Deact Faulted Unknown Inconsistent Stand By ing otftineFautt rms amp pes Java Applet Window Figure 73 Command pop up for an offline application U42117 J Z100 4 76 101 Using Cluster Admin Administration 5 2 4 4 Confirmation pop ups When you select a
70. for the sequencing of Online or Offline requests Groups separated by colons are processed sequentially from left to right for Onl ine requests and from right to left for Of f1 ine requests Each group can be a single application name or a list of application names separated by spaces or tabs All applications in a single group are processed in parallel For example if the sequence is U42117 J Z100 4 76 325 Attributes available to the user Appendix Attributes appl app2a app2b app3 then an On1 ine request would first process app1 then app2a and app2b in parallel and finally app3 e AutoRecover Possible Values 0 1 Default 0 Valid for resource objects If set to 1 executes the online script for an object if the object becomes faulted while in an Online state If the object is able to return to the On1 ne state the fault is recovered This attribute must be 0 for Scalable controllers RMS handles switchover of Scalable child applications automatically e AutoStartUp Possible Values 0 1 Default 0 Valid for userApplication objects Automatically brings an application Online on the SysNode with the highest priority when RMS is started Set to either 0 for no or 1 for yes e AutoSwitchOver Possible Values Valid string containing one or more of the following No HostFailure ResourceFai lure ShutDown Default No Valid for userApplication objects Configures an application for automatic switchover i
71. gt fuji3RMS fuji2RMS 0 Forced switch hvswitch A a Priority switch hvswitch 0 Offline hvutil lapp2 oN app2i 597 4 2003 09 04 1 7 22 02 Clear vault hwutil c ScriptTimeout 300 PreCheckScript hvexec p app2 mydemo PreOnlineScript rm f usroptrelianttmp app2 goingoffline if YSHV_INTENDED_STATE Onil PreOMineScript hvenable app2 ALL rm f usr optrelianttmp app2 online touch usrioptireliant t OffineDoneScript rm fusroptrelianttmp app2 goingoffline Online it Offline Deact Faulted Unknown Inconsistent stand By Warning Offline Fault EH rmsapcs Tsis msg LLL LEED gt Java Applet Window Figure 108 Shutting down an application CLI The syntax for the CLI is as follows hvutil f userApplication Use the command hvutil s userApplication to bring an offline userAp plication to a Standby state 5 3 6 Activating an application Activating an application takes it from the Deact state to the offline state It does not bring it Onl ine Also activating a userApp1ication has nothing to do with activating an RMS configuration the two operations are completely independent Activate a deactivated application as follows 138 U42117 J Z100 4 76 Administration RMS procedures gt Right click on the application object and select the Activate option from the pop up menu CLI The syntax for the CLI is as follows gt hvutil a userApplication You will not need to activate an application unles
72. has been invoked and the RMS base monitor cannot be started on the individual hosts comprising the cluster with the command command A script cannot be started because RMS is unable to create the script process with the command command RMS shuts down on the node where this message appears and returns an error number errno which is the error number returned by the operating system Action Consult the system manual pages or the appendix of this manual for the explanation for error number errno and see if the cause is evident If not contact field support e WRP 5 No handler for this signal event lt signal gt There is no signal handler associated with the signal signal Action Contact field support U42117 J Z100 4 76 273 WRP Wrappers Non fatal error messages e WRP 6 Cannot find process pid processid in the process wrappers Action Critical error Contact field support e WRP 7 getservbyname failed for service name servicename Action Critical error Contact field support e WRP 8 gethostbyname failed for remote host host Action Critical error Contact field support e WRP 9 Socket open failed This message occurs if RMS is unable to create a datagram endpoint for communication Action Contact your System Administrator e WRP 10 connect to server failed Action Critical error Contact field support e WRP 11 Message send failed queue id lt queueid gt
73. hvrcp localfile node remotefile This message is the result of either of these conditions The number of arguments specified is not equal to 2 The second argument is not specified in the form node remotefile hvrcp then exits with exit code 6 Action Follow the intended usage of hvrcp as specified above e Usage hvreset t timeout userApplication An attempt to use the hvreset utility in a way that does not conform to the expected usage leads to this message and the utility exits with exit code 2 Action Follow the expected usage for the utility 318 U42117 J Z100 4 76 Console error messages Console messages in alphabetical order Usage hvsend m message s system 1 w waittime dest_object f in_file dest_object 1 This is a result of using an unknown option with the hvsend command Action Follow the intended usage of the utility Usage hvshut f L a 1 s SysNode If the usage of hvshut does not conform to the expected usage as specified in the above message the hvshut utility exits with the exit code 6 Action Follow the usage specified above Usage hvswitch f userApplication SysNode p userAp plication If an unknown option is used with the hvswi tch utility or if there are more than 2 arguments specified for hvswitch it exits with exit code 6 Action Follow the intended usage of the utility Usage hvutil a d f c s u
74. hvswitch f 163 hvutil c 163 Comment attribute 335 composite subapplication graph 112 configuration information graphs 113 configurations displaying 24 displaying information 96 97 general procedure 40 graph 108 individual node details 99 configuring applications 31 33 disk groups 33 file systems 33 IP addresses 33 resources 33 console error messages 295 controlled applications 12 ControlledShutdown attribute 336 controller attributes 323 dependencies 98 description 323 graph 112 Controller resource wizard 33 controllers 12 Follow mode 13 Scalable mode 14 creating application 58 second application 79 D Deact state 22 debug level wizards 190 debug messages 169 base monitor 171 log directory 171 severity level 180 wizards 187 189 debug reporting wizards 190 DEBUG statements wizards 190 defining timeout 345 DEMO turnkey wizard 33 41 61 62 dependant resources 11 dependencies 98 det_disklog file 173 detectors 10 11 fault situations 156 illegal 151 RMS Wizard Kit 19 RMS Wizard Tools 19 starting 129 DetectorStartScript attribute 336 directed switch requests 165 directories bin 29 build 29 etc 29 include 29 lib 29 us 29 directory hierarchy root directory 344 specifying root directory 344 disk classes as application resources 16 displaying application states 145 current RMS configuration 24 environment variables 142 U42117 J Z100 4 76 389 Index dkconfig 352
75. ication 6 6 3 AutoRecover attribute A node of the type mount is one example of a node that can enter a Faulted state due to reasons that are easily and automatically remedied A fault that occurs in the node itself and not as a result of an input output fault on an under lying disk is most likely from a umount command that was erroneously executed In this case causing the entire application to be switched over probably would not be the best remedy Therefore fault processing would not be the best solution For such cases programmers can configure the AutoRecover attribute in RMS If a fault then occurs when the userApp1 ication is online the online script is invoked before the fault script If the node enters the On1 ine state again within a specific period after the online script has been executed the node goes online again and fault processing does not take place 162 U42117 J Z100 4 76 Advanced RMS concepts Fault processing RMS only evaluates the AutoRecover attribute when the node is the cause of the fault that is when the cause of the fault is not the fault of a child Accord ingly RMS only evaluates AutoRecover for nodes with a detector The AutoRecover attribute is not relevant even if a fault occurs during request processing or in the Offline state The AutoRecover attribute in RMS is not set as a default for any node type The specialist who configures RMS must decide whether to use the attribute 6 6 4
76. is correctly defined on all cluster nodes and that it is always kept up to date When a node is brought back into the cluster remove it from this environment variable If this does not occur data loss could occur because RMS will ignore this node during the startup procedure and will not check whether the application is already running on the nodes specified in this list It is the system administrator s responsibility to keep this list up to date if it is used e HV_AUTOSTART_WAIT Possible values O MAXINT Default 60 seconds Defines the period in seconds that RMS waits for cluster nodes to report Online when RMS is started If this period expires and not all cluster nodes are online a switchlog message indicates the cluster nodes that have not reported On1 ine and why the user application s cannot be started automat ically This attribute generates a warning message only AutoStartUp will proceed even if the specified period has expired e HV_CHECKSUM_INTERVAL Possible values O MAXINT Default 120 Seconds Interval in seconds for which the RMS base monitor waits for each Online node to verify that its checksum is the same as the local checksum If checksums are confirmed within this interval then RMS on the local node continues its operations as usual However if a checksum from a remote node is not confirmed or if it is confirmed to be different then the local monitor shuts down if it has been started less than HV_
77. is normally done by typing the number of the item followed by the Enter or Return key Within the menu a prompting line indicates the kind of input that is required A gt gt prompt indicates that a string of text should be entered e Responding to messages Within the menus several kinds of messages are displayed One type of message might be to inform the user about the activities that the wizard has performed for example a consistency check that ended in a positive result Other messages may prompt the user to continue the configuration procedure with a certain activity for example choosing an application name e HELP This item provides user assistance and is available at the top of every wizard menu e QUIT This quits the wizard menu system e RETURN This moves one level upward in the menu system that is from a subordinate menu to the menu it was called from e SAVE EXIT and NOSAVE EXIT These save or discard your input and then exit SAVE EXIT may be disabled if the configuration is inconsistent at that point U42117 J Z100 4 76 41 Creating and editing a configuration Using the Wizard Tools interface 3 4 2 Main configuration menu The Main configuration menu appears immediately after a configuration has been called up This top level menu shows the state of the RMS cluster by indicating either one the following e RMS is inactive e The list of nodes where RMS is up and running The
78. is running actively on the host systems of a cluster In this case you might call up the configuration because it is to be modified using the wizards On the other hand you might want to use the wizards to set up a new configuration The commands for starting the wizards are as follows eo hvw Runs RMS Wizard Tools using the last activated configuration stored in the RELIANT_PATH etc CONFIG rms startup file If this file does not exist or activation is being done for the first time RMS creates the default configu ration config 40 U42117 J Z100 4 76 Using the Wizard Tools interface Creating and editing a configuration e hvw n configname Edits an existing configuration or creates a new configuration using the specified name The configuration will be stored in the RELIANT_PATH build configname us startup file The sample configuration used for demonstration purposes in this chapter shows how to set up a new configuration called mydemo using the DEMO turnkey wizard This example would be called up as follows hvw n mydemo The hvw command is documented in the online manual pages Refer to the chapter Appendix List of manual pages on page 349 for additional infor mation 3 4 1 Using the wizard menus The hvw command produces character driven menus that guide you in a way designed to be self explanatory The following are some of the most frequently used menu operations and items e Selecting items This
79. longer reachable Please check the status of the host and the ethernet connection 270 U42117 J Z100 4 76 Non fatal error messages WLT Wait list This message is a result of one cluster host detecting that the other host hostname which is part of the cluster is no longer reachable or in other words this cluster host sees the other host hostname as faulted This could be due to the fact that the other host hostname has gone down or there is some problem with the cluster interconnect Action Check if the host hostname is indeed dead if not check if there is a problem with the ethernet connection US 6 RMS has died unexpectedly on the cluster host hostname When the detector on the local host detects that the host hostname has transitioned from Online to Offline unexpectedly it prints this message to the switchlog and then it attempts to kill the host hostname Action Check the syslog on the host hostname to find out the reason why it has gone down US 31 FAULT REASON Resource resource transitioned to a Faulted state due to a detector report This message is printed when the detector unexpectedly reports Faulted state Action Check to see if there is any problem with the resource 8 20 WLT Wait list WLT 1 REASON Resource resource s script scriptexecd has exceeded the ScriptTimeout of timeout seconds The detector script for the resource has exceeded the ScriptTimeout limit Action Make sure
80. messages go to both the switchl og file and also to the system log HV_SYSLOG_USE is an environment variable that you can modify so that messages will or will not show on the system log If you do not want the messages to go into the system log then set HV_SYSLOG_USE 0 in the hvenv Tocal file Before changes can take effect you must stop and restart RMS The default setting in hvenv is HV_SYSLOG_USE 1 This setting sends all RMS ERROR FATAL ERROR WARNING and NOTICE messages to the system log and switchlog For Log3 RMS messages the component number is 1080023 hvlogcontrol The hvlogcontrol utility prevents log files from becoming too large Since large amounts of log files can take up disk space hvlogcontrol limits the amount of log files to a specified amount set in one of the following environment variables selected by the system administrator e HV_LOG_ACTION_THRESHOLD e HV_LOG_WARN_THRESHOLD e HV_SYSLOG_USE hvlogcontrol is called automatically from the crontab file so there is no manual page 186 U42117 J Z100 4 76 Troubleshooting Wizard log files 7 9 Wizard log files The RMS Wizards log messages to files in the same log directory as is defined for RMS according to the value set in the environment variable RELTANT_LOG_PATH RMS Wizards logging can be broken down into two categories as follows e Messages resource detectors e All other messages Detector logging will be explained in more detail in secti
81. object gt does not match the value of the k flag of its associated detector Values for rKind attribute and flag k of the detector startup line do not match Action Fix RMS configuration e BM 107 Illegal different values for rKind attribute in object lt object gt Different values for rKind attribute are encountered within the same object Action Fix RMS configuration e BM 108 101 Dynamic modification failed Scalable controller lt object gt cannot have its attribute lt SplitRequest gt set to 1 Setting controller attributes Scalable and Sp1itRequest is mutually exclusive Action Fix RMS configuration e BM 109 102 Dynamic modification failed Application lt application gt has its attribute PartialCluster set to 1 or is controlled directly or indirectly via a Follow controller that belongs to another application that has its attribute PartialCluster set to 1 this application lt application gt cannot have a cluster exclusive resource lt resource gt An exclusive resource cannot belong to an application with the attribute PartialCluster set to 1 or cannot be controlled directly or indirectly by a Follow controller from an application with the attribute PartialCluster set to 1 Action Fix RMS configuration 242 U42117 J Z100 4 76 Non fatal error messages CML Command line BM 110 103 Dynamic modification failed Application lt application gt is controlled by a
82. object that is being added to userApplication is different from the values of the HostName attributes of other first level children of appname 214 U42117 J Z100 4 76 Non fatal error messages ADMI Admin command and detector queues e ADM 40 25 Dynamic modification failed a new child lt childobject gt of existing application lt appname gt does not have its HostName set to a name of any sysnode When a new child object lt childobject gt is added to an application lt appname gt during dynamic modification if the HostName attribute is missing for this object this message is the result with dynamic modifi cation aborting Action The first level object under appname must have a HostName attribute e ADM 41 8 Dynamic modification failed existing child lt childobject gt is not online but needs to be linked with lt paren tobject gt which is supposed to be brought online If both the parent lt parentobject gt and the child lt childobject gt have detectors associated with them if the state of the child is not online but it needs to be linked to the parent which is supposed to be online then this message will be printed and dynamic modification aborted Action Make sure that the parent and the child are in a similar state e ADM 42 9 Dynamic modification failed existing child lt childobject gt is online but needs to be linked with lt parentobject gt which is supposed to be brought offline
83. on the selected node e Forced shutdown Performs a forced shutdown of RMS Caution Using a forced shutdown or leaving the applications running and stopping RMS can lead to data inconsistencies or corruption Click the Ok button to initiate the shutdown with your selections EARMS Shutdown x Notes This Dialog will allow you to stop RMS on remote nodes Three options are available if you try to stop RMS on one node 1 Stop all UAPs 2 Keep local UAPs 3 Forced shutdown UAF user application Selection of 2 or 3 may break the consistency of the cluster Shutdown RMS all available nodes one node from the list Node Selection Options fuji2RMS fal Ston all UAPS v fuji RMS E Ok Cancel Java Applet Window Figure 103 Stopping RMS on one node from the list 132 U42117 J Z100 4 76 Administration RMS procedures Figure 104 shows the command pop up option to stop RMS on an individual node when you right click on a system node and select Shutdown RMS EA Cluster Admin 01 Cluster Admin File Tools Preferences Help Bru Attributes M tujierms fuji2RMS System Node O app2 RMS Attribute Value a icmComm hvem c mydemo y O appt MonitorOnly 0 MO np o O View Graph J View switchiog Save logs hogclean Delete logs hwogclean d View Environment Shutdown RMS PRIMECLU
84. popular applications such as Oracle or SAP R 3 and they are flexible enough to create custom RMS configurations that can control any other type of appli cation 2 4 How RMS Wizards provide easy configuration PRIMECLUSTER provides the RMS Wizards to allow the creation of flexible and quality tested RMS configurations The RMS Wizards present a simple user interface that prompts you for details regarding the applications The RMS Wizards are designed to easily configure RMS for certain popular applications such as Oracle or SAP R 3 and they are flexible enough to create full RMS configurations that can control any other type of application Specialists skilled in popular applications and in RMS worked together to create the RMS Wizards The RMS Wizards are broken up into the following separate products e RMS Wizard Tools e RMS Wizard Kit Figure 7 depicts the relationship between RMS RMS Wizard Tools and the RMS Wizard Kit U42117 J Z100 4 76 17 How RMS Wizards provide easy configuration Introduction RMS Wizard Kit Application Application Application Application Application Application Application Application Application specific specific specific wizard script detector k RMS Wizard Tools pe Resource Resource T y Reso
85. reason RMS exits with exit code 50 Action Restart RMS e ADM 57 hvdisp open failed filename 218 U42117 J Z100 4 76 Non fatal error messages ADMI Admin command and detector queues If RMS is unable to open the file opt SMAW SMAWRrms To0cks rms lt pid gt for writing when hvdisp has been invoked this message is printed out Action Verify that the directory opt SMAW SMAWRrms 1ocks exists and allows files to be created correct permissions free space in the file system free inodes If one of these problems exists fix it via the appropriate admin istrator operation If none of these problems apply but the RMS failure still occurs contact RMS support e ADM 58 hvdisp open failed filename errormsg When hvdisp is unable to open the file file opt SMAW SMAWRrms locks rms lt pid gt for writing it prints out the reason errormsg Action Verify that the directory opt SMAW SMAWRrms 1ocks exists and allows files to be created correct permissions free space in the file system free inodes If one of these problems exists fix it via the appropriate admin istrator operation If none of these problems apply but the RMS failure still occurs contact RMS support e ADM 59 appname modification is in progress switch request skipped This message is printed to the switchlog because commands like hvswitch hvutil and hvshut cannot run in parallel with a non local hvmod A
86. received by the base monitor at runtime Limited use to administrators since turning on log level flags consumes a great deal of disk space By default RMS places no messages in bmlog 172 U42117 J Z100 4 76 Troubleshooting Log files Module FileName Contents Everything switchlog Operational events such as base monitor resource switches or bugs generic detector Normally switchlog is the only node detector log file users need to examine generic detector lt program gt og All messages and job assignments received by the detector Also contains resource state change information and all error messages program is the name of the detector in the lt RELIANT_PATH gt directory Messages from the built in node detector hvdet_node node detector hvdet_nodelog hvdet_node Table 7 Log files U42117 J Z100 4 76 173 Using the log viewer Troubleshooting 7 4 Using the log viewer Invoke the log viewer for the RMS switchl og file as follows 1 Right click on a SysNode in the RMS Tree 2 Select View switch log Figure 114 shows how to invoke the log viewer Eg Cluster Admin EJ view Switch log Notes This dialog will allow you to view el RMS Log messages on a node The 9 ERE gt fuji3RMS node can be in any state to do so JD appz View Switchlog from RMS Node ID app lc Select one from the list 3 Selection O wait O Offline
87. required since some applications may be unable to start during online processing thus causing the application to become Faulted The PreCheckScript will be forked before the original online processing begins If the script is successful and returns with an exit code of 0 online processing proceeds as usual If the script fails and returns with an exit code other than 0 online processing is discarded and a warning is written into the switchlog Resulting state When the PreCheckScript is running the userApp1 ication node transits into the Wait state If the PreCheckScript fails the userApplication node transits back into its previous state usually Offline or Faulted AutoSwitchOver If the PreCheckScript fails and the AutoSwitchOver is true then RMS automatically forwards the online request to the next priority host except in cases of directed switch requests 6 4 4 Fault situations during online processing If an error situation occurs during online processing the affected node commences fault processing and notifies its parent of the error see also the section Fault processing on page 159 The following can cause faults during online processing e Detector signals the Faul ted state e Detector signals the Offline state for a node that was reported as Online U42117 J Z100 4 76 155 Online processing Advanced RMS concepts e Script fails with an exit status other than 0 e Script fails with a timeout
88. section Object types on page 26 introduces the RMS object types e The section Attributes on page 26 defines the RMS attributes e The section Environment variables on page 27 lists the RMS environment variables e The section Directory structure on page 29 lists and describes the RMS directory structure 2 1 PRIMECLUSTER overview The PRIMECLUSTER family of products is an integrated set of cluster services including high availability scalability parallel application support cluster file system cluster volume management and administration Figure 1 llustrates the relationship of PRIMECLUSTER services U42117 J Z100 4 76 9 How RMS provides high availability Introduction High Parallel Scalable Custom availability applications Internet services A Figure 1 Overview of PRIMECLUSTER The sections that follow focus on the role of the following PRIMECLUSTER products as they relate to high availability operation e RMS This high availability manager is a software monitor that provides high availability HA for customer applications in a cluster of nodes Its task is to monitor systems and application resources to identify any failures and to provide application availability virtually without interruption in the event of any such failures e RMS Wizard Tools and RMS Wizard Kit These configuration products are used to create RMS configuratio
89. specific application type wizard to provide all the predefined elements for example scripts and detectors that go with that application type The chapter Configuration example on page 57 shows how to use some of the secondary menus A more detailed description of these menus is given in the RMS Wizards documentation package 3 4 4 Basic and non basic settings Basic and non basic settings are designed to guide you safely through the configuration process ensuring that all mandatory settings are configured Among the basic settings are the application name and the names of the nodes where it can run For example at the application type selection menu shown in the previous section selecting 5 DEMO produces the menu in Figure 12 U42117 J Z100 4 76 47 Creating and editing a configuration Using the Wizard Tools interface Consistency check Settings of turnkey wizard DEMO 1 HELP 2 NO SAVE EXIT 3 SAVE EXIT 4 REMOVE EXIT 5 6 7 Choose the setting to process 7 ApplicationName APP3 BeingControlled no Machines Basics Yet to do process the basic settings using Machines Basics Yet to do choose a proper application name Figure 12 Menu leading to basic settings If you select 7 Machines Basics you can configure the basic settings using the menu in Figure 13 Items enclosed in parenthesis are optional Consistency check Machines Basics appl consistent 1 HEL
90. the non basic settings menu 4 11 Setting up a controlling application The basic settings have been specified However we still need to set up APP2 to control APP1 This will involve the following two steps available in the non basic settings e Create a controller object for APP2 e Specify APP1 as the application to be controlled The previous step has taken you to the non basic settings menu Figure 52 Consistency check Yet to do process at least one of the non basic settings Settings of turnkey wizard GENERIC 1 HELP 0 RemoteFileSystems Lo 1 IpAddresses 3 SAVE EXIT 2 RawDisks 4 3 RC VolumeManagement 5 ApplicationName APP2 4 VERITAS VolumeManagement 6 Machines Basics app2 5 EMC RdfManagement 7 CommandLines 6 FibreCat MirrorView 8 Controllers 7 Gds Global Disk Services 9 LocalFileSystems 8 Gls Global Link Services Choose the setting to process 8 Figure 52 Non basic settings gt Select Controllers by entering the number 8 This creates a controller object for APP2 and presents a menu that lets you specify the controller settings Figure 55 U42117 J Z100 4 76 83 Specifying controlled applications Configuration example Consistency check Yet to do assign at least one application to control Yet to do configure at least one controlled application without the M flag Settings of application t
91. the state of the object is ignored by the parent when calculating the parent s state Any parent should have at least one child for which MonitorOn1y is not set e OfflineScript Possible Values Valid script character Default empty Valid for all object types except SysNode objects Specifies the script to be run to bring the associated resource to the Of f1ine state 328 U42117 J Z100 4 76 Appendix Attributes Attributes available to the user e OnlinePriority Possible Values 0 1 Default 0 Valid for userApp1 ication objects Allows RMS to start the application on the node where it was last On1 ine when the entire cluster was brought down and then restarted In case of AutoStartUp or a priority switch this last Online node has the highest priority regardless of its position in the priority list If set to 1 the application comes Online on the node where it was last Online If not set 0 the application comes On1 ine on the node with the highest priority in the attribute PriorityList RMS keeps track of where the application was last Onl ine by means of timestamps The node which has the latest timestamp for an application is the node on which the application will go Online Different cluster nodes should be in time synchronization with each other but this is not always the case Since RMS does not provide a mechanism for ensuring time synchro nization between the nodes in the cluster this responsibi
92. transition request originates U42117 J Z100 4 76 333 Attributes available to the user Appendix Attributes from the controller itself The script is also executed once each time a SysNode in the child application object s PriorityList changes its state to Of fline or Faulted e WarningScript Possible Values Valid script character Default empty Valid for resource objects with detector Specifies the script to be run after the posted state of the associated resource changes to Warning 334 U42117 J Z100 4 76 Appendix Attributes Attributes managed by configuration wizards 13 2 Attributes managed by configuration wizards Attributes in this section are managed internally by the configuration wizards e Affiliation Possible Values Any string Default empty Valid for resource objects Used for display purposes in the user interface no functional meaning within RMS e AutoRecoverCleanup Possible Values 0 1 Default 1 Valid for controller objects If set to 1 and AutoRecover is 1 then a faulted child application is requested to go Off 1 ine before recovering If set to 0 and AutoRecover is 1 then a faulted child application recovers without going Offline e Class Possible Values any string Default Default type as defined in Appendix Object types Valid for all objects except SysNode Describes the class of the resource object Used by other programs for vario
93. variables on page 28 for more details Global variable settings ENV are included in the configurations checksum that is common to the cluster The checksum is verified on each node during startup of the base monitor While RMS is running you can display the environment variables with the hvdisp command which does not require root privilege Use hvdisp ENV for the global list and hvdisp ENVL for the local list The global environment variables ENV are as follows HV_AUTOSTARTUP_IGNORE e HV_AUTOSTART_WATT HV_CHECKSUM_INTERVAL HV_LOG_ACTION_THRESHOLD HV_LOG_WARN_THRESHOLD e HV_WAIT_CONFIG RELIANT_LOG_LIFE RELIANT_LOG_PATH RELIANT_PATH RELIANT_SHUT_MIN_WAIT U42117 J Z100 4 76 27 Environment variables Introduction The local environment variables ENVL are as follows e HV_CONNECT_TIMEOUT e HV_LOG_ACTION e HV_MAX_HVDISP_FILE_SIZE e HV_MAXPROC e HV_RCSTART e HV_SYSLOG_USE e RELIANT_HOSTNAME e RELIANT_INITSCRIPT e RELIANT_STARTUP_PATH e SCRIPTS_TIME_OUT Refer to the chapter Appendix Environment variables on page 341 for a description of all global and local environment variables 2 9 1 Setting environment variables When RMS starts it reads the values of environment variables from hvenv and hvenv local and initializes the ENV and ENVL objects respectively To set the values of environment variables before starting RMS the variables hav
94. with the state of the controller or because the list contains duplicate elements This message appears when the user tries to change the Resource attribute of the controller object lt controller gt from lt oldresource gt to lt newre source gt because one or more of the applications listed in lt newresource gt is not an existing application or its state is incompatible with the state of the controller or because the list contains duplicate elements Action Make sure that the applications listed in the resource lt newresource gt are not written more than once or invalid CADM 100 74 Dynamic modification failed because a controller lt controller gt has AutoRecover set to 1 its controlled application lt appname gt cannot have PreserveState set to 0 or AutoSwitchOver set to 1 If an application needs to be controlled by a controller then the applica tions attributes PreserveState and AutoSwitchOver need to be 1 and O respectively if the controller has its AutoRecover set to 1 Action Check the PreserveState and AutoSwitchOver attribute of the appli cation ADM 106 The total number of SysNodes specified in the configuration for this cluster is hosts This exceeds the maximum allowable number of SysNodes in a cluster which is maxhosts The total number of SysNode objects in the cluster has exceeded the maximum allowable limit Action Make sure that the total number of SysNode objects in the cluster does not
95. 0 14 Further notes about controllers 15 How the Wizard Tools provide easy configuration 16 How RMS Wizards provide easy configuration 17 RMS Wizard Tools 2 19 RMS Wizard Kit o o o e 0 o o 19 Cluster Admin o e e 20 RMS components o e 20 Base MONO iaa wk a a ee a 20 Detectors and states 0 o 21 SOMD S 500005 300 0 a a a A ARA 22 RMS Gli a a A A A o eG 23 Object types o ee 26 Attributes a oe 8 dn e Pe a ee 26 Environment variables 00 0000 27 Setting environment variables 28 Directory structure 0 o e 2 29 U42117 J Z100 4 76 Contents Using the Wizard Tools interface 31 Overview satsa oe a a a ase ed 31 RMS Wizard types 2 020 002220 32 Turnkey wizards ee eee nae 32 Resource wizards o e o 33 Site preparation o e e ee 34 Network sesona do es hao de dd A han Abas a ced 34 File systems Solaris only 20 36 NFS Lock Failover Solaris only 37 File systems Linux only o o e 38 LOTES 2 is ica A a e ARR od e A 39 Other system services and databases 39 General configuration procedure 40 Creating and editi
96. 0 inthe hvenv Tocal file The default is 1 3 2 5 Other system services and databases RMS requires the following system services or databases to be configured according to the instructions in the PRIMECLUSTER Installation Guide Solaris Linux e etc nsswitch conf system service lookup order database e rcp rsh service e echo service Linux only U42117 J Z100 4 76 39 General configuration procedure Using the Wizard Tools interface 3 3 General configuration procedure RMS configuration always involves these four steps gt Stop RMS Refer to the section Stopping RMS on page 130 You can use the Cluster Admin GUI or the command line interface from any node in the cluster Create or edit the configuration The next section provides general information and the chapter Config uration example on page 57 walks through an example Activate the configuration Activation includes generation and distribution See the section Activating a configuration on page 49 Start RMS Refer to the section Starting RMS on page 126 You can use the Cluster Admin GUI or the command line interface from any node in the cluster To avoid network access problems perform RMS configuration tasks as root and ensure that rhosts and the rcp rsh services are configured as described in the Installation Guide 3 4 Creating and editing a configuration You can bring up an existing wizard configuration that
97. 0J hvexec F demo c 172 25 220 27 pCommands 0J hvexec F demo u 172 25 220 27 ckCommandsl0J hvdet_demo Timeout 300 oRecover no nitorOnly no he setting to process 3 Figure 33 Successful consistency check for APP1 The consistency check is successful you can now use RMS to run APP with the mydemo configuration Note that the wizard updated the display information for the scripts in items 6 StartCommands 0 and 7 StopCommands 0 This completes the specification ofthe non basic settings You can now save the non basic settings and exit this part of the configuration procedure gt From the CommandLines menu Figure 33 select SAVE EXIT by entering the number 3 This will take you back to the Settings of turnkey wizard DEMO menu Figure 34 U42117 J Z100 4 76 71 Specifying a display Configuration example 1 2 3 4 5 6 7 8 9 10 Choose Consistency check Settings of turnkey wizard DEMO HELP SAVE EXIT ApplicationName APP1 Machines Basics appl CommandLines Controllers DEMO Dem_APP1 LocalFileSystems the setting to process 3 1 2 3 4 5 6 7 8 9 RemoteFileSystems IpAddresses RawDisks RC VolumeManagement VERITAS VolumeManagement EMC RdfManagement FibreCat MirrorView Gds Global Disk Services G1s Global Link Services Figure 34 Turnkey wizard DEMO By s
98. 100 4 76 Troubleshooting Specifying the log level Log Level Meaning 12 Unused 13 Token level 14 Detector message 15 Local queue level 16 Local queue level 17 Script level 18 userApplication contract level 19 Temporary debug traces 20 SysNode traces 21 Message level 22 bm tracelog Table 9 Log levels You can also control logging with the RMS Wizard Tools or PCS e From the Wizard Tools Main configuration menu select Configuration Edit Global Settings gt DetectorDetails The menu that appears will allow you to set the log level for detectors in the configuration e From any PCS window select the configuration or any other item in the left hand tree then use Option gt Trace and select the level of detail from the submenu Figure 121 U42117 J Z100 4 76 183 Specifying the log level Troubleshooting EgPcS PRIMECLUSTER Configuration Services Figure 121 Controlling the log level with PCS 184 U42117 J Z100 4 76 Troubleshooting Interpreting log files 7 7 Interpreting log files Each process that makes up RMS generates three types of log messages user trace and error These log messages are contained in the following files switchlog lt program gt 10g switchlog file Records RMS events relevant to the user such as switch requests and fault indications Records trace messages or error mes
99. 17 J Z100 4 76 31 Overview Using the Wizard Tools interface For example if a node should fail to be available the node that is to take its place must have been defined beforehand so that the applications depending on this node are able to continue operating with minimal interruption Once the necessary information is defined you can then set up an RMS configuration A configuration of this magnitude however requires a great deal of expert knowledge The RMS Wizards are tools that allow you to set up an RMS configuration in a way that is simple flexible and quality tested Furthermore these tools conform to a well documented standard design To configure RMS with the wizards you supply information about the applications using a menu driven interface The wizards use this information to set up a complete RMS configuration The following sections describe these wizards and the way they are used to configure high availability from a general point of view 3 1 1 RMS Wizard types The RMS Wizards are divided into two categories e RMS Wizard Tools These resource oriented wizards provide scripts and detectors for basic resources such as file systems or IP addresses The Wizard Tools also contain the GENERIC and DEMO application oriented wizards e RMS Wizard Kit These application oriented wizards are designed to cover complete applications and perform their tasks on the basis of the turnkey concept The R 3 and ORACLE wizards are
100. 198 U42117 J Z100 4 76 Non fatal error messages ADC Admin configuration hvmod has been invoked without the 1 option and the application is busy Some other modification is already in progress or some requests are being processed or application contracting is ongoing Action Reissue the hvmod command when the application has completed the current switch request ADC 27 Dynamic modification failed Dynamic modification has failed The exact reason for the failure is displayed in the message preceding this one Action Check the error messages occurring in the switchlog or prior to this message to find out the exact cause of the failure ADC 30 HV_WAIT_CONFIG value lt seconds gt is incorrect using 120 instead If the value of the environment variable HV_WAIT_CONFIGis 0 or has not been set the default value of 120 is used instead Action Set the value of HV_WAIT_CONFIG in opt SMAW SMAWRrms bin hvenv ADC 31 Cannot get the NET_SEND_Q queue RMS uses the NET_SEND_Q queue for transmitting contract information If there is some problem with this queue the operation is aborted The operation can be any one of the following hvrcp hvcopy Action Contact field support ADC 32 Message send failed during the file copy of file lt filename gt A error occurred while transferring file lt filename gt across the network Action Check if there are any problems with the network ADC 33 Dynam
101. 2117 J Z100 4 76 255 NOD Node detector Non fatal error messages If there is no CIP entry corresponding to the SysNode lt sysnode gt in etc cip cf this message is the result and hvdet_node exits with exit code 139 Action Make sure that there is a corresponding CIP entry for the SysNode lt sysnode gt in etc cip cf e NOD 16 detector failed to get information about RMS base monitor bm When the detector hvdet_node finds that the RMS base monitor is not it exits with exit code 142 Action This might be due to the fact that hvdet_node has been started indepen dently of RMS e NOD 17 Failed to set up SIGCHLD handler Action Critical error Contact field support e NOD 18 Can t fork child hvdet_node Action Critical error Contact field support e NOD 20 detector Cannot create socket errorreason If there is a problem in the creation of an endpoint for communication between the detectors lt detector gt on the different hosts in the cluster it manifests itself as a message in the switchlog and the detector exits with the exit code 111 Action Contact field support e NOD 21 detector Failed to bind address to socket error reason If there is a problem in binding the endpoint of communication between the detectors lt detector gt on the different hosts in the cluster to a particular port the result is this message with lt errorreason gt indicating the reason for this erro
102. 4 U42117 J Z100 4 76 Appendix List of manual pages SCON 15 12 SCON scon start the cluster console software 15 13 SF System administration resd Shutdown Daemon of the Shutdown Facility rcsd cfg configuration file for the Shutdown Daemon SA_blade cfg configuration file for FSC server blade Shutdown Agent SA_rccu cfg configuration file for RCCU Shutdown Agent SA_rps cfg configuration file for Remote Power Switch Shutdown Agent SA_rsb cfg configuration file for RemoteView Services Board Shutdown Agent SA_scon cfg configuration file for SCON Shutdown Agent SA_pprci cfg configuration file for RCI Shutdown Agent PRIMEPOWER only SA_sspint cfg configuration file for Sun E10000 Shutdown Agent SA_sunF cfg configuration file for sunF system controller Shutdown Agent SA_wtinps cfg configuration file for WTI NPS Shutdown Agent sdtool interface tool for the Shutdown Daemon U42117 J Z100 4 76 355 SIS Appendix List of manual pages 15 14 SIS System administration dtcpadmin start the SIS administration utility dtcpd start the SIS daemon for configuring VIPs dtcpstat status information about SIS 15 15 Web Based Admin View System administration fjsvwvbs stop Web Based Admin View fjsvwvcnf start stop or restart the web server for Web Based Admin View wvCnt1 start stop or get debugging information for Web Based Admin View wvGetparam display Web Based Admin View s envi
103. 5 ALL CF HOSTS 6 fuji2RMS 7 fuji3RMS Choose the host to add 7 1 E sr Figure 19 Add hosts to a cluster menu This menu displays the current set of nodes and lists the machines that can be selected If you select 5 ALL CF HOSTS the RMS Wizards add all nodes in etc cip cf etc hosts to this configuration Otherwise you can add hosts individually from the displayed list gt Select fwji2RMS by entering the number 6 Select fujisRMS by entering the number 7 see Figure 19 At this screen you can also choose 4 FREECHOICE which will allow you to enter host names that are not listed in the menu gt After all primary host names have been added use 3 RETURN to return to the Main configuration menu By default these host names are of the form machinenameRMS to follow i the RMS naming convention To override the default RMS name for a machine modify that machine s hvenv 1ocal file and set the RELIANT_HOSTNAME variable to the desired name This must be done before you add the machine to the cluster in this step To remove a node select 17 RMS RemoveMachine from the Main configuration menu The Remove hosts from a cluster menu appears Figure 20 60 U42117 J Z100 4 76 Configuration example Creating an application Removal Remove hosts from a cluster Current set fuji2RMS fuji3RMS 1 HELP 2 QUIT 3 RETURN 4 ALL 5 fuji2RMS 6 fuji3RMS Choose the host to remove
104. 6 RMS CreateMachine 8 Configuration Activate 7 RMS RemoveMachine 9 Configuration Copy Choose an action 8 Figure 61 Main configuration menu Select Configuration Activate by entering the number 8 No further input is required at this stage As the Wizard completes each task in the activation phase it displays status information as described in the section Activating a configuration on page 49 You will be prompted to continue at the end of the process see Figure 54 The new configuration was distributed successfully About to put the new configuration in effect done The activation has finished successfully Hit CR to continue Figure 62 Activating the configuration for the second time gt Press the Enter or Return key to return to the Main configuration menu Figure 63 88 U42117 J Z100 4 76 Configuration example Starting RMS 1 2 3 4 5 6 7 8 9 fuji2 Main configuration menu No RMS active in the cluster HELP QUIT Application Create Application Edit Application Remove Application Clone Configuration Generate Configuration Activate Configuration Copy Choose an action 2 current configuration mydemo 0 1 2 3 4 5 6 7 Configuration Remove Configuration Freeze Configuration Thaw Configuration Edit Global Settings Configuration Consistency Report Configuration ScriptExecutio
105. 65 taking offline 138 viewing logs 146 ApplicationSequence attribute 325 attributes Affiliation 335 Alternatelp 325 ApplicationSequence 325 AutoRecover 326 AutoRecoverCleanup 335 AutoStartUp 326 AutoSwitchOver 326 Class 335 ClusterExclusive 326 Comment 335 ControlledShutdown 336 DetectorStartScript 336 FaultScript 327 Follow 327 Halt 328 HostName 336 _List 328 IgnoreOfflineRequest 336 IgnoreOnlineRequest 337 IgnoreStandbyRequest 337 IndependentSwitch 337 LieOffline 338 MaxControllers 328 MonitorOnly 328 NoDisplay 338 NullDetector 338 OfflineDoneScript 338 OfflineScript 328 OnlinePriority 329 OnlineScript 329 OnlineTimeout 338 PartialCluster 330 PersistentFault 339 PostOfflineScript 330 PostOnlineScript 330 PreCheckScript 339 PreOfflineScript 330 PreOnlineScript 331 PreserveState 331 PriorityList 331 Resource 339 rName 339 Scalable 331 ScriptTimeout 332 ShutdownPriority 332 SplitRequest 340 StandbyCapable 332 U42117 J Z100 4 76 387 Index StandbyTimeout 333 StateChangeScript 333 WarningScript 334 AutoRecover fault processing 162 AutoRecover attribute 326 AutoRecoverCleanup attribute 335 AutoStartUp attribute 326 AutoSwitchOver fault processing 160 AutoSwitchOver attribute 326 B base monitor 20 debug messages 171 detectors 54 high availability 11 log file 172 log levels 182 messages 172 stack tracing 182 switchlog 186 basic settings wizards 47 bin directory 29
106. AC vdisk If a fault occurs within a resource used by a userApplication object then only that userApp1i U42117 J Z100 4 76 11 How RMS provides high availability Introduction cation can be switched to another node in the cluster userApplication failover involves offline processing for the object on the first node followed by online processing for the object on a second node There are also situations in which RMS requires a node to be shut down or killed In any case before switching applications to a new node RMS works together with the PRIMECLUSTER Shutdown Facility to guarantee that the original node is completely shut down This helps to protect data integrity RMS also has the ability to recover a resource locally that is a faulted resource can be brought back to the online state without switching the entire userApp1i cation to another cluster node 2 2 3 Controlled applications and controller objects In some situations it is desirable for one application to control another in a parent child relationship Consider the scenario in Figure 2 in which a bank teller application depends on the network represented by an Ipaddress subapplication and a database application which depends on a local file system represented by an Fsystem subapplication If either the network or the database fails in some way the parent teller application cannot complete any transactions Therefore from the RMS perspective the database a
107. Action Set the PriorityList attribute of lt appname gt to include all the host names listed in the HostName attributes of the application s children No duplicate host names should be present in the PriorityList e ADM 81 61 Dynamic modification failed application lt appname gt may not have more than lt maxcontroller gt parent controllers as specified in its attribute MaxControllers If lt appname gt Uses more parent controllers than specified by the attribute MaxControllers lt maxcontroller gt this message is the result and dynamic modification aborts Action Make sure that the number of parent controllers used by an application is less than the number specified as part of the MaxControllers attribute or modify MaxControllers to increase the number e ADM 82 62 Dynamic modification failed cannot delete SysNode lt sysnode gt unless its state is one of Unknown Wait Offline or Faulted This message may appear in the switchlog if there is an attempt to delete a SysNode from a running configuration if this SysNode is not in one of the states Unknown Offline Wait or Faulted Action Shut down RMS on that host and do the deletion e ADM 83 63 Dynamic modification failed cannot delete SysNode lt sysnode gt since this RMS monitor is running on this SysNode During dynamic modification the local SysNode lt sysnode gt was going to be deleted Action 222 U42117 J Z100 4 76 Non fatal error me
108. Administration If there are both Faulted applications and applications that are not online anywhere then the Faul ted applications are shown above the ones that are not online anywhere A Cluster Admin oi xi File Tools Preferences Help QW rus 160 wirms o JO appa i Value hve c 2 0 210 tuisrms amp JO sa jo o D me O PEEN e gt a Scripts criptTimeo cluster FUJI I Show State Names Java Applet Window Stay ros Sinn an a a Java Applet Window Figure 92 Faulted and offline applications in the clusterwide table 120 U42117 J Z100 4 76 Administration Using Cluster Admin If there is a split brain condition in the cluster on both the clusterwide table and the RMS tree then colored exclamation marks will appear after the colored circles for SysNodes A colored exclamation mark indicates that the state of that SysNode is different from what another SysNode views it as being The color of the exclamation mark indicates the state that the other node thinks that the Sy sNode is in If there are multiple nodes that see a Sy sNode in different states you will see multiple exclamation marks after the colored circle Exclamation marks are sorted according to the severity of the states Figure 93 shows a clusterwide table with an application of a split brain condition ES Cluster Admin Jol xi
109. Application object controlled application Failover of controlled applications If a child changes to an offline or faulted state RMS may switch the parent the child and all the dependent resources to other nodes The exact action depends on whether the controller has been configured to operate in Follow or Scalable mode as discussed below 2 2 3 1 Follow controllers When a controller operates in Follow mode the corresponding child application must always run on the same node as the parent that is if the parent is switched to another node the Follow mode application and all its dependent resources will be switched there too Likewise if the child application fails in a way that requires it to be switched to another node then the parent must be switched there as well This is illustrated in Figure 4 U42117 J Z100 4 76 13 How RMS provides high availability Introduction teller application Z4 CD Online Figure 4 Follow mode switchover Note the state of the Follow controller in Figure 4 Like the child application it is brought online only on the same node as the parent Follow controllers can guarantee that a group of applications and their resources always run together on the same machine 2 2 3 2 Scalable controllers Scalable controllers allow the parent and child applications to run on separate machines This not only allows more flexibility but it may also prevent delays or outages when resources
110. CHECKSUM_INTERVAL seconds before 342 U42117 J Z100 4 76 Appendix Environment variables Global environment variables Also if a checksum from a remote node is not confirmed or if the checksum is confirmed to be different then the local monitor considers the remote node as Offline if that local monitor has been started more than HV_CHECKSUM_INTERVAL seconds before e HV_LOG_ACTION_THRESHOLD Possible values 0 100 Default 98 Defines the behavior of hv ogcontrol If the used space on the log disk is larger or equal to this threshold all subdirectories below log will be removed Furthermore if HV_LOG_ACTION is set to on and all subdirectories have already been removed the actual log files will be removed too Refer to the section Local environment variables on page 345 for more information on HV_LOG_ACTION e HV_LOG_WARN_THRESHOLD Possible values 0 100 Default 95 Defines the behavior of hvl ogcontrol If the used space on the file system containing the RMS log disk is larger or equal to this threshold value the hvlogcontrol script issues a warning to the user regarding the large amount of log files e HV_LOH_INTERVAL Possible values O MAXINT Default 30 Minimum difference in seconds when comparing timestamps to determine the last online host for an application The last online host LOH specifies the host where the userApp1l ication was online most recently lt is deter mined if the Onl inePriority
111. CRIPTS_TIME_OUT 348 ScriptTimeout attribute 332 sdtool 355 searching log text 107 secondary management server 92 secondary menus wizards 46 send clear fault request 163 severity levels Alert 180 Critical 180 Debug 180 Emergency 180 Error 180 Info 180 Notice 180 Warning 180 Shutdown Facility 12 ShutdownPriority attribute 332 SIS commands dtcpadmin 356 dtcpd 356 ditcpdbg 356 site preparation 34 software monitor function 1 RMS 10 SplitRequest attribute 340 Standby state 21 StandbyCapable attribute 332 StandbyTimeout attribute 333 U42117 J Z100 4 76 395 Index starting an application 134 RMS 126 startup file 130 state changes nodes 149 StateChangeScript script 23 StateChangeScript attribute 333 states 21 Deact 22 displaying information 119 Faulted 21 Inconsistent 22 Offline 21 OfflineFault 21 Online 21 Standby 21 Unknown 22 Wait 22 140 Warning 21 state triggered scripts FaultScript 23 OfflineDoneScript 23 PostOfflineScript 23 PostOnlineScript 23 WarningScript 23 stopping RMS 130 strace Linux trace tool 194 subapplication graph 111 subapplications 98 sub menus wizards 46 summary table 119 switch processing definition 165 fault situations 167 switching application to Sysnode 23 25 switching an application 136 switchlog 165 file 171 173 panel 104 viewing 103 145 switchlog error messages 195 SysNode 11 54 detector 164 fault 164 initiali
112. Cluster Interconnect Protocol CLI command line interface CLM Cluster Manager CRM Cluster Resource Management DLPI Data Link Provider Interface ENS Event Notification Services U42117 J Z100 4 76 375 Abbreviations GDS GFS GLS GUI HA ICF 1 0 JOIN LAN MDS MIB MIPC NIC NSM Global Disk Services Global File Services Global Link Services graphical user interface high availability Internode Communication Facility input output cluster join services module local area network Meta Data Server Management Information Base Mesh Interprocessor Communication network interface card Node State Monitor 376 U42117 J Z100 4 76 Abbreviations OSD operating system dependent PAS Parallel Application Services PCS PRIMECLUSTER Configuration Services RCCU Remote Console Control Unit RCFS PRIMECLUSTER File Share RCI Remote Cabinet Interface RCVM PRIMECLUSTER Volume Manager RMS Reliant Monitor Services SA Shutdown Agent SAN Storage Area Network SCON single console software SD Shutdown Daemon SF Shutdown Facility SIS Scalable Internet Services U42117 J Z100 4 76 377 Abbreviations VIP Virtual Interface Provider 378 U42117 J Z100 4 76 Figures Figure 1 Overview of PRIMECLUSTER 10 Figure 2 Controlled application scenario 12 Figure 3 RMS representation o
113. Dynamic modification failed sanity check did not pass for linked or unlinked objects Dynamic modification performs some sanity checks to ensure that all of the following are true 210 U42117 J Z100 4 76 Non fatal error messages ADM Admin command and detector queues The HostName attribute is present only for children of userApp1i cation objects The child of a userApplication does not have another parent Each object belongs to only one userApplication Leaf objects have detectors Leaf objects that have the DeviceName attribute have it set to a valid value The length of the attribute rName for the leaf objects is smaller than the maximum There are no duplicate lines in the hvgdstartup file The kind argument for the detector in the hvgdstartup is specified All detectors can be loaded A valid value has been specified for the rKind attribute The ScriptTimeout value is greater than the detector cycle time No objects are and and or at the same time ClusterExclusive and LieOffl ine which are mutually exclusive are not used together If some of these sanity checks fail then this message will be printed and dynamic modification is aborted AFATAL message is also printed to the switchlog with more details as to why the sanity check failed Action Make sure that the sanity checks mentioned above pass e ADM 24 45 Dynamic modification failed object l
114. Figure 120 is an example of a log file search based on a severity level ivar opt reliantlog app2 log on fuji2RMS Time Filter Enable Start Time 2002 Ej 10 Em 25 jo 10 Eh 13 jm end Time 2002 Ey 10 Em 25 Eo 1 En 13 Elm Keyword Filter Resource Name No Selection v Severity Notice v Non zero exit code Keyword detection po Filter 2002 10 25 10 24 30 NOTICE enable resource detection for Controllerd000Of_app2 2 2002 10 25 10 24 30 NOTICE enable resource detection for Cmd_APP2 2002 10 25 10 24 30 NOTICE enable resource detection for AllControllersOk_app2 2002 10 25 10 24 30 NOTICE enable resource detection for Machine000 app2 2002 10 25 10 24 30 NOTICE Status Done Figure 120 Results of severity level based search 7 5 Using the hvdump command The hvdump command is used to get debugging information about RMS on the local node Independent of the base monitor running on the local node invoking hvdump causes it to gather PRIMECLUSTER product and system files that will be used to diagnose the problem For a detailed list of the information gathered consult the hvdump 1M manual page U42117 J Z100 4 76 181 Specifying the log level Troubleshooting 7 6 Specifying the log level For further debugging information use the 1 level option of the hvcmor hvuti 1 commands to activate var
115. FileSystems Choose the setting to process Figure 14 Menu to configure non basic settings 3 5 Activating a configuration As described in section General configuration procedure on page 40 activating a configuration is the third of the four fundamental steps required to set up a high availability configuration The activation phase comprises a number of tasks among which are generation and distribution of a configu ration i You must stop RMS before you activate a configuration The starting point for the activation phase is the Main configuration menu see Figure 15 U42117 J Z100 4 76 49 Activating a configuration Using the Wizard Tools interface 1 2 3 4 5 6 7 8 9 fuji2 Main configuration menu current No RMS active in the cluster HELP QUIT Application Create Application Edit Application Remove Application Clone Configuration Generate Configuration Activate Configuration Copy Choose an action 8 configuration mydemo 0 1 2 3 4 5 6 7 Configuration Remove Configuration Freeze Configuration Thaw Configuration Edit Global Settings Configuration Consistency Report Configuration ScriptExecution RMS CreateMachine RMS RemoveMachine Figure 15 Main configuration menu gt Select the Configuration Activate item by entering the number 8 The activation is performed by the wizard No further input is requir
116. IT 4 REMOVE EXIT 5 AdditionalMachine 6 AdditionalConsole 7 Machines 0 fuji2RMS 8 Machines 11 fuji3RMS 9 PreCheckScript 0 PreOnlineScript 1 PostOnlineScript 2 PreOfflineScript 3 0fflineDoneScript 4 FaultScript 5 AutoStartUp no 6 AutoSwitchOver HostFailure ResourceFailure 7 PreserveState no 8 PersistentFaul t 0 9 ShutdownPriority 20 OnlinePriority 21 StandbyTransitions 22 LicenseToKill no0 23 AutoBreak yes 24 HaltFlag no 25 PartialCluster 0 26 ScriptTimeout Choose the setting to process 3 Figure 29 Saving settings Save your settings now to complete the Application Create process gt Select SAVE EXIT by entering the number 3 U42117 J Z100 4 76 Entering non basic settings Configuration example 46 Entering non basic settings The DEMO turnkey wizard performs another consistency check before returning you to the wizard settings menu Figure 30 YConsistency check Yet to do process at least one of the non basic settings Settings of turnkey wizard DEMO 1 HELP 1 RemoteFileSystems Cy 2 IpAddresses 3 SAVE EXIT 3 RawDisks 4 4 RC VolumeManagement 5 ApplicationName APP1 5 VERITAS VolumeManagement 6 Machines Basics appl 6 EMC RdfManagement 7 CommandLines 7 FibreCat MirrorView 8 Controllers 8 Gds Global Disk Services 9 DE
117. In most cases itis not necessary to specify options to the hvcm command The base monitor is the decision making module of RMS It controls the configuration and access to all RMS resources If a resource fails the base monitor analyzes the failure and initiates the appropriate action according to the specifications for the resource in the configuration file hvconfig Performs two tasks displaying the current RMS configuration or sending the current configuration to an output file The output of the hvconfig command is equivalent to the running RMS configuration file but does not include any comments that are in the original file Also the order in which the resources are listed in the output might vary from the actual configuration file hvdisp Displays information about the current configuration for RMS resources Does not require root privilege hvdist Distributes the configuration file to all nodes within an RMS configuration hvdump Gets debugging information about RMS on the local node cy vgdmake Makes compiles a custom detector so that it can be used in the RMS configuration The user first prepares a source file for the detector which must be a file with a c extension Table 1 Available CLI commands 24 U42117 J Z100 4 76 Introduction RMS components Command Function hvlogclean Saves old log files into a subdirectory whose name is the time
118. LOG_ACTION_THRESHOLD HV_LOG_WARNING_THRESHOLD HV_WAIT_CONFIG or HV_RCSTART This will eventually cause RMS to exit with exit code 1 Action Set the value of the environment variable to an appropriate value e ADC 17 lt hostname gt is not in the Wait state hvutil u request skipped When hvutil u has been invoked on a node if the SysNode for that node is not in the Wait State then this message will appear internal option Action If the hvutil u was issued prematurely then reissue the command once the node has reached the Wait state e ADC 18 Local environment variable lt envattribute gt is not set in hvenv file If one of the local environment variables lt envattribute gt is not set in hvenv this message is the result envattribute can be any one of the following SCRIPTS_TIME_OUT RELIANT_INITSCRIPT RELIANT _STARTUP_PATH HV_CONNECT_TIMEOUT HV_MAXPROC or HV_SYSLOG_USE This will eventually cause RMS to exit with exit code 1 Action Set the value of envattribute to an appropriate value e ADC 20 lt hostname gt is not in the Wait state hvutil o request skipped The hvuti1 o command has been invoked on a node but its SysNode is not in the Wait State Internal option Action The hvutil o was issued prematurely Reissue the command after the SysNode has reached the Wait state e ADC 25 Application lt appname gt is locked or busy modifi cation request skipped
119. MO 9 Gls Global Link Services 10 LocalFileSystems Choose the setting to process 9 Figure 30 Non basic settings The menu header indicates there is at least one more setting to specify but it is not a basic setting As described earlier this application creates an animated graphical picture on an X window display Therefore a display setting for the DEMO wizard must be added to the basic settings you have already entered gt Select DEMO by entering the number 9 The CommandLines menu appears Figure 31 68 U42117 J Z100 4 76 Configuration example Entering non basic settings Consis Yet to tency check do set a display CommandLines Dem_APP1 not yet consistent 1 HELP 2 3 SAVE EXIT 4 REMOVE EXIT 5 Display 6 StartCommandsL0 1 hvexec F demo cC 7 StopCommandsl0J hvexec F demo u 8 CheckCommandsLO hvdet_demo 9 Timeout 300 10 AutoRecover no 11 MonitorOnly no Choose the setting to process 5 Figure 31 Prompting for display specification The menu header indicates that a display still needs to be specified and the status line tells you that APP is not yet consistent that is APP could not yet run with the present mydemo configuration Items in the menu body indicate which scripts the wizard provides for starting stopping and checking see the lines beginning with 6 StartCommands 0 7 StopCommands 0 and 8 C
120. Main configuration menu Changes dynamically at run time depending on whether RMS is running in the cluster and whether the configuration being edited is the current configuration If RMS is running anywhere in the cluster actions that could modify a running configuration are not available Additionally the menu items that are available are modified such that no changes can be made to the running configuration When RMS is running but the configuration being edited is not the same as the currently active one the main menu is not restricted except that the Configu ration Activate menu option is not available 3 4 2 1 Main configuration menu when RMS is not active If RMS is not running anywhere then the entire top level menu is presented without restrictions Figure 9 shows the Main configuration menu window when RMS is inactive fuji2 Main configuration menu current configuration mydemo No RMS active in the cluster 1 HELP 0 Configuration Remove 2 QUIT 1 Configuration Freeze 3 Application Create 2 Configuration Thaw 4 Application Edit 3 Configuration Edit Global Settings 5 Application Remove 4 Configuration Consistency Report 6 Application Clone 5 Configuration ScriptExecution 7 Configuration Generate 6 RMS CreateMachine 8 Configuration Activate 7 RMS RemoveMachine 9 Configuration Copy Choose an action Figure 9 Main configuration menu when RMS is not active 40 U42117 J Z100 4 76 U
121. Manual J MOS y CER I Machine Administration common Figure 64 Invoking the Cluster Admin GUI 92 U42117 J Z100 4 76 Administration Using Cluster Admin 5 2 2 Logging in After the Web Based Admin View login screen appears Figure 65 log in as follows gt Enter the user name and password for a user with the appropriate privilege level gt Click on the OK button Server Pprimanf 172 25 219 83 Secondary fuji3 Logout NodeList Version MIAPRIVME CLUSTER amp Global Cluster Services Web Based Admin View tools 3K Manua E web Based Admin Yiew OF x MISC User name root Mk Machine Administration Password m Common 0 Java Applet Window Figure 65 Web Based Admin View login screen Use the appropriate privilege level while logging in Cluster Admin has the following privilege levels e Root privileges Can perform all actions including configuration adminis tration and viewing tasks e Administrative privileges Can view and execute commands but cannot make configuration changes e Operator privileges Can only perform viewing tasks For more details on the privilege levels refer to the PRIMECLUSTER Installation Guide Solaris Linux U42117 J Z100 4 76 93 Using Cluster Admin Administration After clicking on the OK button the top menu appears Figure 66 Serve
122. Ok_app2 2002 10 25 10 24 30 NOTICE enable resource detection for Machine000_app2 2 2002 10 25 10 24 30 NOTICE enable resource detection for Machine001_app2 Figure 117 Resource based search Non zero exit code Status Done U42117 J Z100 4 76 177 Using the log viewer Troubleshooting 7 4 2 Search based on time Search the log files based on the date and time range as follows 1 Specify the start and end times for the search range 2 Click on Enable 3 Press the Filter button Figure 118 shows the results for a search based on the time filter Nvar optireliantlog app2 log on fuji2RMS Enapie Start Time 2002 jy 10m 25 jp 10 Eh 13 jm End Time 2002 Ejy ro Em 25 Elo fu En 13 Elm Resource Name No Selection v Severity No Selection Y Non zero exit code Keyword Fitter 2002 10 25 10 24 30 NOTICE enable resource detection for Controllerd000Of_app2 7 2002 10 25 10 24 30 NOTICE enable resource detection for AllControllersOk _app2 pad q 25 w a a NOTICE enable resource detection for Machine000 app2 Figure 118 Results of time based search 178 U42117 J Z100 4 76 Troubleshooting Using the log viewer 7 4 3 Search based on keyword Search the log files based on a keyword as follows 1 Enter a keyword 2 Click on the Filter button Figure 119 shows an example of a log file search ba
123. P 2 3 SAVE EXIT 4 REMOVE EXIT 5 AdditionalMachine 6 7 8 9 10 11 12 13 Choose the setting to process AdditionalConsole Machines 0 1 fuji2RMS PreCheckScript PreOnlineScript PostOnlineScript PreOfflineScript Off1ineDoneScript FaultScript 14 15 16 17 18 19 20 21 22 23 24 25 AutoStartUp no AutoSwitchOver N0 PreserveState no PersistentFaul t 0 ShutdownPriority OnlinePriority StandbyTransitions LicenseToKill n0 AutoBreak yes HaltFlag no PartialCluster 0 ScriptTimeout Figure 13 Menu to configure basic settings 48 U42117 J Z100 4 76 Using the Wizard Tools interface Activating a configuration After you complete the configuration of the basic settings the non basic settings menu appears Figure 14 Non basic settings include specifications for resources such as file systems IP adresses disks and so forth Consistency check Yet to do process at least one of the non basic settings Settings of turnkey wizard DEMO 1 HELP 1 RemoteFileSystems Zio 2 IpAddresses 3 SAVE EXIT 3 RawDisks 4 4 RC VolumeManagement 5 ApplicationName APP1 5 VERITAS VolumeManagement 6 Machines Basics appl 6 EMC RdfManagement 7 CommandLines 7 FibreCat MirrorView 8 Controllers 8 Gds Global Disk Services 9 DEMO 9 Gls Global Link Services 10 Local
124. PRIMECLUSTER Reliant Monitor Services RMS with Wizard Tools Solaris Linux Configuration and Administration Guide Edition December 2003 Comments Suggestions Corrections The User Documentation Department would like to know your opinion of this manual Your feedback helps us optimize our documentation to suit your individual needs Fax forms for sending us your comments are included in the back of the manual There you will also find the addresses of the relevant User Documentation Department Certified documentation according DIN EN ISO 9001 2000 To ensure a consistently high quality standard and user friendliness this documentation was created to meet the regulations of a quality management system which complies with the requirements of the standard DIN EN ISO 9001 2000 cognitas Gesellschaft fur Technik Dokumentation mbH www cognitas de Copyright and Trademarks Copyright 2002 2003 Fujitsu Siemens Computers Inc and Fujitsu LIMITED All rights reserved Delivery subject to availability right of technical modifications reserved Solaris and Java are trademarks or registered trademarks of Sun Microsystems Inc in the United States and other countries Linux is a registered trademark of Linus Torvalds All other hardware and software names used are trademarks of their respective companies This manual is printed on paper treated with chlorine free bleach Preface Introduction
125. Parentheses Enclose items that must be grouped together when repeated Ellipsis Signifies an item that may be repeated If a group of items can be repeated the group is enclosed in parentheses 6 U42117 J Z100 4 76 Preface Important notes and cautions 1 4 Important notes and cautions Material of particular interest is preceded by the following symbols in this manual i Contains important information about the subject at hand Caution Indicates a situation that can cause harm to data U42117 J Z100 4 76 7 Important notes and cautions Preface 8 U42117 J Z100 4 76 2 Introduction This chapter contains general information on Reliant Monitor Services RMS introduces the PRIMECLUSTER family of products details how RMS RMS Wizard Tools and the RMS Wizard Kit work together to produce high availability configurations and introduces Cluster Admin This chapter discusses the following e The section PRIMECLUSTER overview on page 9 describes how RMS functions within the PRIMECLUSTER family of products e The section How RMS provides high availability on page 10 describes how RMS supplies high availability e The section How RMS Wizards provide easy configuration on page 17 details the RMS Wizard products RMS Wizard Tools and RMS Wizard Kit e The section Cluster Admin on page 20 introduces the Cluster Admin graphical user interface GUI e The
126. RMS was last started unless the d option is used to delete the old log files instead Regardless hvl ogclean creates a clean set of log files even while RMS is running hvrclev Sets the RMS default start run level to 3 to allow for the system processes started in the remote file sharing state as well as any user application resources started in run level 3 The hvrclev command can be used to reset the RMS default start run level back to the original run level 2 The hvrclev command is typically called from pkgadd to automatically adjust the RMS start run level for those customers who have a default system run level of 3 hvreset Reinitializes the graph of an RMS user application on one or more nodes in the configuration Running scripts will be termi nated ongoing requests and contracts will be cleaned up and the entire graph will be brought back into a consistent initial state This command is intended for use by experts only hvshut Shuts down RMS on one or more nodes in the configuration The base monitor on the local node sends a message to other online nodes indicating which node or nodes will be shut down hvswitch Manually switches control of a user application resource from one system node to another in the RMS configuration The resource being switched must be of type userApplication The system node must be of type SysNode hvthrottle Prevents multiple scripts within a configuration file from ru
127. STER Configuration Services PCS ptireliant binitools d hvalert ANY ERROR Sysnode fuji2RMS faulted Qonline O wait ottiine Odesct Faulted Unknown Inconsistent stand By Warning Offline Fault cf rms amp pcs sis msg Java Applet Window Figure 104 Using command pop up to stop RMS CLI The syntax for the CLI is as follows hvshut f L a 1 s nodename The hvshut command shuts down the RMS software on one or more nodes in the configuration The base monitor on the local node sends a message to other online nodes indicating which node or nodes are to be shut down The hvshut command disables all error detection and recovery routines on the nodes being shut down but does not shut down the operating system If any userApp1i cation objects are online when the f or L options are used the applications remain running but are no longer monitored by RMS The L option does a forced shutdown of RMS without shutting down the appli cation The f option does an emergency shutdown of RMS Both options only affect the local node but the f option is for emergencies when other hvshut options do not work U42117 J Z100 4 76 133 RMS procedures Administration Caution Use the f and L options carefully as they could result in inconsis tencies or data corruption 5 3 3 Starting an application Bring an application online as fo
128. This field may be empty if no resource is associated with the message The state field is an indication of the type of action that is being performed and is the value as set by RMS in the environment variable HV_SCRIPT_TYPE The field typically contains the values online or offline The RMS Wizards also set the field with the value PreCheck when a PreCheck script is being run This field will be empty for messages of type DEBUG being printed The timestamp field contains the date when the message occurred and is written in the format yyyy mm dd hh mm ss where yyyy is the 4 digit year mm is the month number dd is the day of the month hh is the hour in the range of 0 23 mm is the minute of the hour ss is the number of seconds past the hour Message type is defined as one of the following e DEBUG e NOTICE e WARNING e ERROR e FATAL ERROR 188 U42117 J Z100 4 76 Troubleshooting Wizard log files Messages are any text generated by the RMS Wizard product This text can contain one or more new lines The delimiter is defined as a series of four equal signs Debug messages from scripts which are run can be forced by setting the environment variable HV_SCRIPTS_DEBUG to 1 in the hvenv 1ocal file The entry should appear as follows export HV_SCRIPTS_DEBUG 1 To turn off debug output either remove the HV_SCRIPTS_DEBUG entry from the hvenv local file comment it out or set the value to 0 When debugging problems t
129. UIT 1 Configuration Freeze 3 Application Create 2 Configuration Thaw 4 Application Edit 3 Configuration Edit Global Settings 5 Application Remove 4 Configuration Consistency Report 6 Application Clone 5 Configuration ScriptExecution 7 Configuration Generate 6 RMS CreateMachine 8 Configuration Activate 7 RMS RemoveMachine 9 Configuration Copy Choose an action Figure 60 Main configuration menu This completes the creation of the second application U42117 J Z100 4 76 87 Activating the configuration a second time Configuration example 4 13 Activating the configuration a second time After returning to the Main configuration menu you must activate the mydemo configuration for the second time This has to be done because you have modified the configuration by adding another application RMS cannot be running while you activate a configuration In this example we stopped RMS before creating the second application To activate the configuration begin at the Main configuration menu Figure 61 fuji2 Main configuration menu current configuration mydemo No RMS active in the cluster 1 HELP 0 Configuration Remove 2 QUIT 1 Configuration Freeze 3 Application Create 2 Configuration Thaw 4 Application Edit 3 Configuration Edit Global Settings 5 Application Remove 4 Configuration Consistency Report 6 Application Clone 5 Configuration ScriptExecution 7 Configuration Generate
130. Verify that sdtool and Shutdown Facility are operating properly 9 12 SYS SysNode objects SYS 33 The RMS cluster host lt hostname gt does not have a valid entry in the etc hosts file The lookup function gethostbyname failed Please change the name of the host to a valid etc hosts entry and then restart RMS If the lookup function gethostbyname which searches the file etc hosts to get information about the host hostname is unable to find a valid entry for it this message is printed and RMS exits with exit code 114 Action 292 U42117 J Z100 4 76 Fatal error messages UAP userApplication objects Make sure that the host name hostname has a valid entry in etc hosts and restart RMS SYS 52 SysNode sysnode error creating necessary message queue NODE_REO_0 exiting When RMS encounters a problem in creating the NODE_REQ_Q this message is the result and RMS exits with exit code 12 Action Contact field support 9 13 UAP userApplication objects UAP 36 object double fault occurred but Halt attribute o is set RMS will exit immediately in order to allow a failover When the Halt attribute is set for an object and a double fault occurs then RMS will exit with code 96 on that node Action Contact field support 9 14 US us files US 1 RMS will not start up fatal errors in configuration e file Errors were found in the configuration file that prevented RMS startup This is usual
131. Y 9 R3CI 10 RTP Application Type 5 Figure 22 Application type selection menu This example uses the DEMO application type which has been designed to familiarize the user with the configuration process and is intended for demon stration purposes only other than a few user specified attributes everything is preset and ready to run To configure a real world application you would instead select the GENERIC application type as described in the section Creating a second application on page 79 gt Select the DEMO application type by entering the number 5 You have now assigned the DEMO application type to your application This means the DEMO turnkey wizard will provide the application with scripts and detectors that were developed for this application type There are however more parameters to specify before this application can run One of them might be the application name you can assign a name of your choice to any application that you configure for RMS In this case there is no need to specify an application name as the DEMO wizard provides APP as a default here APP is a simple application developed specifically for this example that generates an animated graphical figure on an X window display It will be used demonstrate how an application can be started stopped or switched and how RMS performs failover when the application process is killed on the initial node After performing a consisten
132. a 3 SAVE EXIT 4 REMOVE EXIT 5 ControlPolicy FOLLOW 6 AdditionalAppToControl 7 Controllers l0 AT150 appl 8 InParallel 9 FaultScript Choose the setting to process 3 Figure 58 Indication of flags set for controller Note that your settings are confirmed on item 7 Controllers 0 the A and T flags have been set for APP1 gt Select SAVE EXIT by entering the number 3 This takes you back to the GENERIC menu Figure 59 86 U42117 J Z100 4 76 Configuration example Specifying controlled applications Consistency check Settings of turnkey wizard GENERIC 1 HELP 0 RemoteFileSystems 2 1 IpAddresses 3 SAVE EXIT 2 RawDisks 4 3 RC VolumeManagement 5 ApplicationName APP2 4 VERITAS VolumeManagement 6 Machines Basics app2 5 EMC RdfManagement 7 CommandLines 6 FibreCat MirrorView 8 Controllers Ct1_APP2 7 Gds Global Disk Services 9 LocalFileSystems 8 Gls Global Link Services Choose the setting to process 3 Figure 59 Menu with settings for GENERIC turnkey wizard In the GENERIC menu item 8 Controllers now displays a controller assigned to APP2 gt Select SAVE EXIT by entering the number 3 This takes you back to the Main configuration menu Figure 60 fuji2 Main configuration menu current configuration mydemo No RMS active in the cluster 1 HELP 0 Configuration Remove 2 Q
133. a new resource lt resource gt with the name of an existing resource it prints out this message and dynamic modification aborts Action 206 U42117 J Z100 4 76 Non fatal error messages ADM Admin command and detector queues Make sure that when adding a new resource its name does not match the name of any other existing resource e ADM 8 29 Dynamic modification failed cycle of length lt cyclelength gt detected in resource lt resource gt lt cycle gt In the overall structure of the graph of the RMS resources no cycles are allowed along the chains of parent child links If this is not the case then dynamic modification fails and the message specified above will be printed to the switchlog Action Get rid of the cycles e ADM 9 34 Dynamic modification failed cannot modify resource lt resource gt since it is going to be deleted Since deleting a resource causes all its children with no other parents to get deleted as well deleting a resource and then modifying the attributes of the deleted resource or a child of that resource that has no other parents leads to dynamic modification being aborted and the message being printed to the switchlog Action While performing dynamic modification of a resource make sure that the resource that is being modified has not been deleted e ADM 11 37 Dynamic modification failed cannot delete object lt resource gt since it is a descendant of another obje
134. age is printed and hvassert exits with exit code 1 Action Make sure that the remote host is Online before performing hvassert e Remote host does not exist host If the SysNode specified as hvassert h host is not part of the RMS resource graph this message is printed and hvassert exits with exit code 10 Action Make sure that the remote hostname specified for hvassert exists Remote system is not online Trying to perform hvassert on an object on a remote host which does not have RMS running causes this message and hvassert exits with exit code 10 Action Make sure that the remote system has RMS running before performing hvassert e Reset of RMS has been aborted When the user invokes hvreset the hvreset utility asks for a confir mation If the answer is not yes then hvreset is aborted and this message is printed out Action None required U42117 J Z100 4 76 309 Console messages in alphabetical order Console error messages e Resource does not exist resource If there is an attempt to perform hvassert on a resource which is not part of the RMS resource graph this message is printed and hvassert then exits with exit code 10 Action Make sure that a resource exists before performing hvassert on it e RMS has failed to start hvecm has been invoked without specifying a configuration with the c attribute but with specifying other command line options This may cause ambiguit
135. ailed cannot link to or unlink from an application lt appname gt If the parent of the resource is a userApplication then linking to or unlinking a child from that parent is not possible If there is an attempt to perform this then the above message will be printed to the switchlog and dynamic modification will be aborted Action Do not link or unlink a resource from a userApplication e ADM 15 41 Dynamic modification failed parent object lt parentobject gt is not a resource When RMS gets a directive to link existing resources during dynamic modification if the parent object lt parentobject gt to which the child object is being linked is not a resource then dynamic modification fails and this message is printed Action Make sure that while linking 2 objects the parent of the child object is a resource 208 U42117 J Z100 4 76 Non fatal error messages ADMI Admin command and detector queues ADM 16 42 Dynamic modification failed child object lt childobject gt is not a resource When RMS gets a directive to link existing resources during dynamic modification if the child object lt childobject gt that is being linked to a parent object is not a resource then dynamic modification fails and this message is printed Action Make sure that while linking 2 objects the child of the parent object is a resource ADM 17 43 Dynamic modification failed cannot link parent lt parentobject gt and
136. al circumstances the local host should always be present in the list If this is not the case this message is the result Action Contact field support e CRT 2 cannot obtain the NET_SEND_Q queue RMS uses internal queues for sending contracts contracts are messages that are transmitted between the hosts in a cluster and which ensure that the different hosts are synchronized with respect to a particular operation be it between processes on the same host or processes on different hosts If there is a problem with the queue NET_SEND_Q that is being used to transmit these contracts from one host to the other in the RMS cluster it manifests itself as this message in the switchlog Action Contact field support e CRT 3 Message send failed When RMS tries to send a message to another host in the cluster if the delivery of this message over the queue NET_SEND_Q has failed this message is the result This could be due to the fact that the host that is to receive the message has gone down or there is a problem with the cluster interconnect Action Check to make sure that the other hosts in the cluster are all alive and make sure that none of them are experiencing any network problems e CRT 4 Contract retransmit failed Message Id messageid see bmlog for contract details When RMS on one host sends a contract to another host or itself if there is only one host in the cluster over the queue NET_SEND_Q it tries to t
137. al description 32 GENERIC turnkey 80 hvexec command 53 main menu 42 non basic settings 47 ORACLE 33 R 3 33 resource wizards 33 secondary menus 46 sub menus 46 turn off debug output 189 turnkey 32 43 55 wizards log messages 187 wvCntl 356 wvGetparam 356 wvSetparam 356 wvstat 356 U42117 J Z100 4 76 397 Index 398 U42117 J Z100 4 76 Fujitsu Siemens Computers GmbH Co mme nts User Documentation 33094 Paderborn Suggestions Germany E Corrections Fax 49 700 372 00001 email manuals O fujitsu siemens com http manuals fujitsu siemens com Submitted by Comments on PRIMECLUSTER Reliant Monitor Services RMS with Wizard Tools Solaris Linux Configuration and Administration Guide U42117 J Z100 4 76 Fujitsu Siemens Computers GmbH Co mme nts User Documentation 33094 Paderborn Suggestions Germany E Corrections Fax 49 700 372 00001 email manuals O fujitsu siemens com http manuals fujitsu siemens com Submitted by Comments on PRIMECLUSTER Reliant Monitor Services RMS with Wizard Tools Solaris Linux Configuration and Administration Guide U42117 J Z100 4 76
138. ally switches control of a userApplication resource from one system node to another in the RMS configuration The resource being switched must be of type userApplication The system node must be of type SysNode The f option is a forced switch option Caution Use the f option carefully as it could result in inconsistencies or data corruption U42117 J Z100 4 76 135 RMS procedures Administration 5 3 4 Switching an application Switch an online application as follows 1 Right click on the application object and select the Switch menu option A pull down menu appears listing the available nodes for switchover 2 Select the target node from the pull down menu to switch the application to that node Figure 106 EA Cluster Admin 01 File Tools Preferences Help B Fus Attributes MO ujizrms app2 on fuji2RMS User Application C OMe Attribute Value of View Application Graph r o View Subapplication Graph 0 10 Y Composite Subapplication Gi prer No o Loc US PP ion Graph UserApplicationfapp2 Online ivswitch riority Switch hvswitch gt fyji2RMs Forced switch hvswitch f gt fujiBRMS 2 Priority switch hvswitch ite 0 fuji3 RMS fuji2RMS Offline hwutil f ity 0 Clear fault hwutil c ault 0 NoDisplay 0 Affiliation lapp2 Comment app2 15974 2003 09 04 17 22 02 Scripts ScriptTimeout 300 PreCh
139. an the maximum possible loglevel maxloglevel this message is the result and RMS exits with exit code 4 Action Specify a loglevel between 1 and maxloglevel for hvcm 1 e CML 19 Invalid range lt low high gt Within the 1 option the end range value must be larger than the first one When a range of loglevels has been specified with 1 option for hvcn if the value of the end range high is smaller than the value of low this message appears and RMS exits with exit code 4 Action Specify the end range value to be higher than the initial end range value e CML 20 Log level must be numeric If the log level specified with the 1 option for hvcm is not a number this message is the result and RMS exits with exit code 4 Action Specify a numeric value for the log level e CML 21 0 is an invalid range value 0 implies all values If a range is desired the valid range is 1 maxloglevel with the 1 option If the log level specified with the 1 option of hvcm is outside the valid range this message is printed and RMS exits with exit code 4 Action The valid range for the 1 option of hvcm is 1 maxloglevel 244 U42117 J Z100 4 76 Non fatal error messages CRT Contracts and contract jobs 8 6 CRT Contracts and contract jobs e CRT 1 FindNextHost local host not found in priority list of nodename The RMS base monitor maintains a priority list of all the hosts in the cluster Under norm
140. any circular dependencies For example if A depends on B and B depends on C then C cannot depend on A 6 1 2 Resource description The configuration wizards generate descriptions for each application s required resources The descriptions include the following e What action occurs if the state of a resource changes e How RMS should configure or de configure a resource e What interdependencies exist between the resources The configuration file uses a typical RMS meta language and has the following characteristics e Objects represent resources e Parent child relationships between objects represent interdependencies between resources e Object attributes represent the properties of the resources and the actions that are required for specific resources 148 U42117 J Z100 4 76 Advanced RMS concepts Internal organization Upon startup RMS interprets the configuration file and distributes the infor mation to all cluster nodes 6 1 3 Messages In RMS objects exchange messages with the following e Detectors e Command interface e GUI e Other objects Objects exchange this data for the following purposes e To send requests e To communicate changes in the object states In general objects communicate only with their direct parents and children RMS sends incoming external requests to the userApplication object and then forwards the requests to the children The userApplication can also generate its own reques
141. application on the local host must be set to a defined Offline state The procedure is the same as that described under offline processing 2 When offline processing is successfully completed an online request is sent to the corresponding userApp1ication of a remote host see the section Switch processing on page 165 However the userApp1ication is now in the Faul ted state unlike the situation with a normal offline request This prevents the possibility of an application returning to the host in the event of another switchover If a further fault occurs during offline processing for example if RMS cannot deconfigure the resource of a node that was notified of a Faul ted state then it does not execute a switchover procedure RMS does not execute a switchover because it views the resources as being in an undefined state The userAp plication does not initiate any further actions and blocks all external non forced requests 160 U42117 J Z100 4 76 Advanced RMS concepts Fault processing A failure during offline processing is called a double fault A double fault causes the machine to be eliminated if the userApp1ication halt flag is set This situation cannot be resolved by RMS and requires the intervention of the system administrator The following principle is applicable for RMS in this case Preventing the possible destruction of data is more important than maintaining the availability of the application If the ap
142. applications Each child application will remain online on the same node where it was running before the switchover The IndependentSwitch attribute is ignored when the parent is switched via a forced request such as hvswitch f In this case the controller propagates the switch request to each child application as if the attribute were set to 0 Must be 0 for a Fol 1 ow controller Must be 1 for a Scalable controller U42117 J Z100 4 76 337 Attributes managed by configuration wizards Appendix Attributes e LieOffline Possible Values O 1 Default 1 Valid for all resource objects If set to 1 allows the resource to remain Online during Of f1 ine processing e NoDisplay Possible Values 0 1 Default 0 Valid for all object types If set to 1 specifies that the resource should not be displayed when hvdisp is active Can be overridden by hvdisp S e NullDetector Possible Values on of f Default of f Valid for resource objects with detector Used to disable a detector at runtime by setting Nul1Detector to on This attribute is for the use with dynamic reconfiguration only Nul 1 Detector must never be set hard coded to on in the RMS configuration file e OfflineDoneScript Possible Values Valid script character Default empty Valid for userApp1ication objects The last script run after the application has completed offline processing e OnlineTimeout Possible Values 0O MAXINT Default 0 Va
143. arate window Figure 91 EARMS cluster FUJI olx Applications fujiz o fuji3 ami am2 E o Show State Names Java Applet Window Figure 91 Clusterwide table You can increase or decrease the size of the clusterwide table window and the size of the columns by using the mouse If the window is already large enough to fully display all of the table elements then you will not be allowed to further increase its size A square surrounding the colored state circle indicates the primary node for the application Figure 91 shows that fuji2 is the primary node for all of the appli cations Normally the clusterwide table displays applications in alphabetical order from top to bottom However Faul ted applications are handled specially If an appli cation is in the Faulted state on any node in the cluster then it is displayed at the top of the table and the application s name is highlighted by a pink background This allows the System Administrator to easily spot any Faulted applications The clusterwide table also makes special provisions for applications that are not online anywhere in the cluster These applications are also displayed at the top of the table and the application s name is highlighted in a light blue Thus the System Administrator can see what applications are not running anywhere and should probably be brought online on some node U42117 J Z100 4 76 119 Using Cluster Admin
144. ason for the problem and reissue hvcm s e Error while starting up local bm errorreason Error while starting up local bm errorreason Action Take action based on the reason e Failed to dup a file descriptor If RMS is unable to dup a file descriptor while setting the environment this message appears Action Contact field support Failed to exec the hvenv file lt hvenvfile gt RMS was unable to exec the hvenv environment variable file hvenvfile Action Contact field support 302 U42117 J Z100 4 76 Console error messages Console messages in alphabetical order e Failed to open pipe If RMS is unable to open a pipe for communication this message is the result and RMS exits with exit code 1 Action Contact field support e FATAL ERROR Could not restart RMS Restart count exceeded When the detector tries to restart RMS it keeps track of how many times RMS had to be restarted If this count has exceeded 3 then this message is the result Action Contact field support e FATAL ERROR Could not restart RMS Restart script script does not exist When the detector is unable to restart RMS because the script script is non existent this message is the result Action Make sure that the script script exists e FATAL ERROR Could not restart RMS Failed to recreate RMS restart count file When the detector tries to restart RMS it keeps track of how many times RMS had to be restarted by writi
145. associated component ADC Admin configuration ADM Admin command and detector queues BM Base monitor CML Command line CMM Communication CRT Contracts and contract jobs DET Detectors INI init script MIS Miscellaneous QUE Message queues SCR Scripts SYS SysNode objects UAP userApplication objects US us files WLT Wait list WRP Wrappers U42117 J Z100 4 76 281 ADC Admin configuration Fatal error messages 9 1 ADC Admin configuration e ADC 16 Because some of the global environment variables were not set in hvenv file RMS cannot start up Shutting down All of the global environment variables RELTANT_LOG_LIFE RELTANT_SHUT_MIN_WAIT HV_CHECKSUM_INTERVAL HV_LOG_ACTION_THRESHOLD HV_LOG_WARNING_THRESHOLD HV_WAIT_CONFIG and HV_RCSTART have to be set in the hvenv in order for RMS to function properly If some of them have not been set RMS exits with exit code 1 Action Set the values of all the environment variables in hvenv e ADC 21 Because some of the local environment variables were not set in hvenv file RMS cannot start up Shutting down If some of the local environment variables have not been set in the hvenv file RMS prints this message and exits with exit code 1 Action Make sure that all the local environment variables have been set to an appropriate value in the hvenv file e ADC 69 RMS will not start up previous errors opening file The prev
146. at the same time When a resource lt object gt is added to an existing RMS resource graph and it is linked as a child to two parent objects one of which is online and the other offline standby this message is the result a child object needs to be brought to the state of its parent Action Make sure that both the parents of the resource to be added are in the same state before adding it e ADM 38 7 Dynamic modification failed existing parent resource lt parentobject gt is in state lt state gt but needs to be in one of stateOnline stateStandby stateOffline state Faulted or stateUnknown During dynamic modification if the state lt state gt of a parent resource lt parentobject gt is not one of the states stateOnline stateOffline state Faulted or stateUnknown dynamic modification aborts Action Make sure that the state of the parent resource is one of the states mentioned above e ADM 39 28 Dynamic modification failed new resource object which is a child of application lt userApplication gt has its HostName lt hostname gt the same as another child of application lt appname gt When a new object object is being added as a child of lt appname gt and the value of its HostName attribute is the same as the value of the HostName attribute of an existing child of lt appname gt this message is the result and dynamic modification aborts after this Action Make sure that the HostName attribute of an
147. ate mutex directory The various RMS commands like hvdisp hvswitch hvutil and hvdump utilize the lock files from the directory lt directory gt for signal handling purposes These files are deleted after these commands are completed The locks directory is also cleaned when RMS starts up If they are not cleaned for some reason this message is the result RMS exits with exit code 99 Action Make sure that the locks directory lt directory gt exists e GEN 5 command failed to get information about RMS base monitor bm The generic detector lt command gt was unable to get any information about the base monitor Action Contact field support e GEN 7 command failed to lock virtual memory pages errno value reason reason The generic detector lt command gt was not able to lock its virtual memory pages in physical memory Action Contact field support 8 11 INI init script e INI 1 Cannot open file dumpfile errno errno explanation This message appears when the file lt dumpfile gt failed to open because of the error code lt errno gt explained in lt explanation gt Action Correct the problem according to lt explanation gt e INI 9 Cannot close file dumpfile errno errno explanation U42117 J Z100 4 76 253 MIS Miscellaneous Non fatal error messages This message appears when the file lt dumpfile gt failed to close because of the error code lt errno gt explained in
148. ation Edit Global Settings The Global settings main menu appears Figure 35 Global settings main menu consistent 1 HELP 7 MaxAlternatelps 2 NO SAVE EXIT 8 PreCheckTimeout 3 SAVE EXIT 9 FirstAvailableDetector 0 4 ShowTurnkeyWizardsOnly 10 LastAvailableDetector 127 5 AdditionalAlternatelps 11 MaxMenultemsDisplayed 6 Additionall_List 12 DetectorDetails Choose the global setting to process Figure 35 Global settings main menu Select 5 AdditionalAlternatelps The Global settings machines menu appears Figure 36 U42117 J Z100 4 76 73 Adding Alternatelps to the cluster Linux only Configuration example Global settings machines menu 1 HELP 2 RETURN 3 MORECHOICES 4 fuji2RMS 5 fuji3RMS Choose a host which needs additional RMS Alternatelps Figure 36 Global settings machines menu Starting with item 4 this menu lists all cluster hosts that are already used by at least one application The menu does not show hosts that are unused gt Select 4 fuji2RMS The Global settings Alternatelps first menu for fuji2RMS appears Figure 37 Global settings Alternatelps for fuji2RMS 1 HELP 4 NONE 2 NO SAVE 5 AdditionalAlternatelps 3 SAVE Choose the RMS IpAlias to process Figure 37 Global settings Alternatelps first menu gt Select 5 AdditionalAlternatelps The Global settings AlternateIps second menu for fuji 2RMS appears Fi
149. ation on all but one host 8 9 DET Detectors e DET 1 FAULT REASON Resource lt resource gt transitioned to a Faulted state due to a child fault This message appears when the child faulted unexpectedly thereby causing the resource to fault Action Check to see why the child resource has faulted and based on this take corrective action e DET 2 FAULT REASON Resource lt resource gt transitioned to a Faulted state due to a detector report This message is printed when the detector unexpectedly reports Faulted state 248 U42117 J Z100 4 76 Non fatal error messages DET Detectors Action Check to see why the resource has faulted and take appropriate action e DET 3 FAULT REASON Resource lt resource gt transitioned to a Faulted state due to a script failure This message appears when the detector failed to execute the script for a resource Action Ensure that there is nothing wrong with the script and also check the resource for any problems e DET 4 FAULT REASON Resource lt resource gt transitioned to a Faulted state due to a FaultScript failure This is a double fault When a resource faults due to some reason it runs its Fault script but in this case the Fault script failed to execute for that resource Action Check to see if there is a problem with the resource or with the Fault script e DET 5 FAULT REASON Resource lt resource gt transitioned to a Faulted state due to the r
150. ations facility This module is the network transport layer for all PRIMECLUSTER internode communications It interfaces by means of OS dependent code to the network I O subsystem and guarantees delivery of messages queued for transmission to the destination node in the same sequential order unless the destination node fails IP address See Internet Protocol address IP aliasing This enables several IP addresses aliases to be allocated to one physical network interface With IP aliasing the user can continue communicating with the same IP address even though the application is now running on another node See also Internet Protocol address JOIN CF See Cluster Join Services CF keyword A word that has special meaning in a programming language For example in the configuration file the keyword object identifies the kind of definition that follows leaf object RMS A bottom object in a system graph In the configuration file this object definition is at the beginning of the file A leaf object does not have children LEFTCLUSTER CF A node state that indicates that the node cannot communicate with other nodes in the cluster That is the node has left the cluster The reason for the intermediate LEFTCLUSTER state is to avoid the network partition problem See also UP CF DOWN CF network partition CF node state CF 364 U42117 J Z100 4 76 Glossary link RMS Designates a child or parent relationshi
151. be set for children Table 10 Object types U42117 J Z100 4 76 323 Appendix Object types 324 U42117 J Z100 4 76 13 Appendix Attributes Some object types require specific attributes for RMS to monitor that object type Some attributes can be modified through the user interface while others are managed internally by PCS or the RMS Wizards 13 1 Attributes available to the user Attributes in this section can be changed through the PCS Wizards Tools user interface or the hvattr command e Alternatelp Possible Values Any interconnect name Default empty Valid for Sys Node objects Space separated list that RMS uses as additional cluster interconnects if the interconnect assigned to the SysNode name becomes unavailable All these interconnects must be found in the etc hosts database By default the configuration wizards assume the alternate interconnects to node lt nodename gt have names of the form lt nodename gt rmsAl lt nn gt where lt nn gt is a two digit zero filled number This setting is restricted to very specific configurations and must never be used in a cluster with CF as interconnect e ApplicationSequence Possible Values Valid string character of the format group1 group2 where each group is a space delimited list of userApp1 ication object names Default empty Valid for Scalable controller objects Specifies the list of all child appli cations
152. blems with PCS 7 10 1 Manual Script Execution The expert user will occasionally want to execute an object s scripts online offline etc one at a time for diagnostic purposes This functionality is referred to as Manual Script Execution MSE MSE is called from different places in the PCS GUI and PCS CUI e Inthe PCS GUI this functionary is available from the Draw Graph window e Inthe PCS CUI this functionality is available in the Advanced Menu under the Manual Script Execution Tree The configuration must be generated for MSE to available To invoke MSE the user must right click on the desired RMS resource in the PCS GUI graph or select the MSE menu item in the PCS CUI and then select the desired script to execute The output of the scripts is displayed in the log file var opt SMAW 1log pcs log 7 11 RMS troubleshooting When problems occur RMS prints out meaningful error messages that will assist you in troubleshooting the cause If no message is available the following information may help you diagnose and correct some unusual problems e RMS dies immediately after being started U42117 J Z100 4 76 191 RMS troubleshooting Troubleshooting At startup the RMS base monitor exchanges its configuration checksum with the other base monitors on remote nodes If the checksum of the starting base monitor matches the checksums from the remote nodes the startup process continues If the checksums do not match then the
153. c modification is aborted Action Make sure that the child object that has been specified exists e ADM 32 21 Dynamic modification failed child object lt childobject gt is not a resource When adding a new object to the RMS resource graph if the child lt childobject gt of this new object is not a resource dynamic modification aborts Action Make sure that when adding a new object its child is a resource e ADM 33 5 Dynamic modification failed object lt object gt cannot be deleted since either it is absent or it is not a resource If RMS gets a directive to delete an object lt object gt that is either non existent or not a resource this message is the result along with the failure of dynamic modification Action Make sure that you don t try to delete an object that does not exist e ADM 34 22 Dynamic modification failed deleted object lt object gt is neither a resource nor an application nor a host An object deleted during dynamic modification is neither a resource type object nor a userApplication nor a SysNode object Only resources applications and hosts SysNode objects can be deleted during dynamic modification U42117 J Z100 4 76 213 ADM Admin command and detector queues Non fatal error messages Action Do not delete this object or delete another object e ADM 37 6 Dynamic modification failed resource lt object gt cannot be brought online and offline standby
154. cation 4 Figure 25 List of nodes for failover procedure The Wizards retrieve the default list of nodes from the CIP configuration file Since our application is presently configured for fuji2RMS fuj 3RMS should become the additional node gt Select fujisRMS by entering the number 4 In the menu that follows Figure 26 you will see your selection confirmed fuji3RMS now appears under Machines 1 as the additional node If there is a failure on fuji 2RMS your application is configured to switch over to fuji 3RMS Consistency check Machines Basics appl consistent 1 HELP 14 FaultScript 2h i 15 AutoStartUp no 3 SAVE EXIT 16 AutoSwitchOver No 4 REMOVE EXIT 17 PreserveState no 5 AdditionalMachine 18 PersistentFault 0 6 AdditionalConsole 19 ShutdownPriority 7 Machinesl0OJ fuji2RMS 20 OnlinePriority 8 Machines 1 fuji3RMS 21 StandbyTransitions 9 PreCheckScript 22 LicenseToKill n0 10 PreOnlineScript 23 AutoBreak yes 11 PostOnlineScript 24 HaltFlag no 12 PreOfflineScript 25 PartialCluster 0 13 OfflineDoneScript 26 ScriptTimeout Choose the setting to process 16 Figure 26 Machines Basics menu for additional nodes U42117 J Z100 4 76 65 Entering Machines Basics settings Configuration example At this point the default value of No is specified for 16 AutoSwitchOver This means that to actually switch your application ove
155. cation becomes Of f1ine the controlling application tries to restart it The AUTORECOVER menu item is now in the opposite state that is ready to be toggled to NOT The T TIMEOUT flag limits the amount of time tolerated while bringing the controlled application On1 ine In this example we will reduce the timeout period to 150 seconds gt Change the timeout period by entering 7 gt Inthe menu that appears Figure 56 select FREECHOICE by entering the number 3 1 HELP 2 RETURN 3 FREECHOICE 4 180 Set an appropriate timeout 3 gt gt 150 Figure 56 Changing controller timeout period gt Atthe gt gt prompt enter 150 for the timeout period gt Press Enter or to return to the menu for controller flags Figure 57 U42117 J Z100 4 76 85 Specifying controlled applications Configuration example Set flags for sub application appl Currently set AUTORECOVER TIMEOUT AT150 1 HELP Ly 3 SAVE RETURN 4 DEFAULT 5 MONITORONLY M 6 NOT AUTORECOVERCA 7 TIMEOUT T Choose one of the flags 3 Figure 57 Saving flags for controller After completing the settings save them and return to the Controllers menu as follows gt Select SAVE RETURN by entering the number 3 The Controllers menu shows that the controller settings are now consistent Figure 58 Consistency check Settings of application type Controller consistent 1 HELP ajj
156. ccording to the description or contact field support e BM 69 Some of the OS message queue parameters msgmax lt msgmax gt msgmnb lt msgmnb gt msgmni lt msgmni gt msgtql lt msgtql gt are below lower bounds lt hvmsgmax gt lt hvmsgmnb gt lt hvmsgmni gt lt hvmsgtql gt RMS is shutting down One or more of the system defined message queue parameters is not sufficient for correct operation of RMS RMS shuts down with exit code 28 Action 284 U42117 J Z100 4 76 Fatal error messages CML Command line Change the OS message queue parameters and reboot the OS before restarting RMS BM 82 A message to host lt remotehost gt failed to reach that host after lt count gt delivery attempts Communication with that host has been broken Therefore RMS monitor on this host lt localhost gt is going down A communication breakdown prevented delivery of a message between the local and remote RMS monitors In this case the local monitor exits Action Make sure remotehost is up and that communication between the two hosts is possible Use standard tools such as ping and make sure that the local root account can rlogin or rsh to the remote host After communication has been re established restart the local RMS monitor BM 89 The SysNode length is length This is greater than the maximum allowable length of maxlength RMS will now shut down The SysNode name length is greater than the maximum allowable le
157. ceeds without errors e ADC 49 Error checking hvdisp temporary file lt filename gt errno lt errno gt hvdisp process pid lt processid gt is restarted The RMS base monitor periodically checks the integrity and size of the temporary file used to transfer configuration data to the hvdi sp process If this file cannot be checked then hvdisp process is restarted automat ically though some data may be lost and not displayed at this time Specific OS error code for the error encountered is displayed in ERRNO Action Make sure that the host conditions are such that the temporary file can be checked Sometimes you may need to restart the hvdisp process by hand e ADC 57 An error occurred while writing out the RMS configuration for the joining host The hvjoin operation is aborted When a remote host joins a cluster this host attempts to dump its own configuration for a subsequent transfer to the remote host If the config uration cannot be saved the hv join operation is aborted Action One of the previous messages contain a detailed explanation about the error occurring while saving the configuration Correct the host environment according to the explanation or contact field support e ADC 58 Failed to prepare configuration files for transfer to a joining host Command used lt command gt U42117 J Z100 4 76 203 ADC Admin configuration Non fatal error messages When a remote host joins a cluster this h
158. child lt childobject gt since they are already linked Trying to link a parent lt parentobject gt and a child lt childobject gt which are already linked results in this message Dynamic modification will be aborted Action While trying to perform dynamic modification make sure that the parent and the child that are to be linked are not already linked ADM 18 49 Dynamic modification failed cannot link a faulted child lt childobject gt to parent lt parentobject gt which is not faulted While creating a new link between 2 existing objects during dynamic modification a faulted child lt childobject gt cannot be linked to a parent lt parentobject gt that is not faulted The child first needs to be brought to the state of the parent If this condition is violated the aforementioned message will be printed to the switchlog Dynamic modification is aborted Action Bring the faulted child to the state of the parent before linking them ADM 19 50 Dynamic modification failed cannot link child lt childobject gt which is not online to online parent lt parentobject gt While linking 2 existing objects during dynamic modification the combi nation of states parent Online and child not Online is not allowed When this happens dynamic modification is aborted and a message is printed to the switchlog U42117 J Z100 4 76 209 ADM Admin command and detector queues Non fatal error messages Action The ch
159. code 7 Action Take action based on the errormsg e The user has invoked the hvcm command with the a flag on a host where RMS is already running sending request to start all remaining hosts If hvcm is invoked with the a flag then RMS will be started on the other hosts in the cluster Action None required e timed out Most likely rms on the remote host is dead While performing hvrcp if the command times out because the base monitor on the local host has not received an acknowledgement from the base monitor on the remote host the most probable reason is that the RMS on the remote host is dead Action Make sure that the RMS on the remote host is running U42117 J Z100 4 76 315 Console messages in alphabetical order Console error messages e Too many arguments usage hvmod E The hvmod utility does not expect any arguments when invoked with the E option If not hvmod exits with exit code 1 Action Make sure that hvmod E is not invoked with any arguments e Too many asserted objects maximum is the max Any attempt to assert on a number of objects which is greater than the maximum will cause this message to be printed Action Make sure that the number of asserted objects is less than the max e Usage hvassert L h SysNode q s resource_name resource_state h SysNode q w resource_name resource_state seconds If the utility hvassert has been invoked in a way that does not conform
160. ct that is going to be deleted When there is an attempt to delete a child object when the parent object has been deleted the above message will appear in the switchlog and dynamic modification aborted Action Make sure that when an object is being deleted explicitly its parents have not already been deleted because that means this object has also been deleted e ADM 12 38 Dynamic modification failed cannot delete lt resource gt since its children will be deleted U42117 J Z100 4 76 207 ADM Admin command and detector queues Non fatal error messages When there is an attempt to delete a resource lt resource gt whose children have already been deleted the above message will appear in the switchlog and dynamic modification aborted Action Make sure that when a resource is being deleted explicitly its children have not already been deleted e ADM 13 52 dynamic modification failed object lt resource gt is in state lt state gt while needs to be in one of stateOnline stateStandby stateOffline stateFaulted or stateUnknown Every resource has to be in either one of the states stateOnline stateOf fline stateFaulted stateUnknown or stateStandby If the resource lt resource gt is not in any of the states mentioned above it prints the above message and dynamic modification is aborted Theoretically this is not possible Action Contact field support e ADM 14 48 Dynamic modification f
161. ction Make sure that before a hvswitch is performed hvmod is not operating on appname e ADM 60 lt resource gt is not a userApplication object switch request skipped While performing a switch hvswitch requires a userApplication as its argument If the resource lt resource gt is not a userApplication this message is the result Action Check the man page for hvswitch for usage information U42117 J Z100 4 76 219 ADM Admin command and detector queues Non fatal error messages e ADM 62 The attribute lt ShutdownScript gt may not be specified for object lt object gt The attribute ShutdownScript is a hidden attribute within a SysNode The RMS base monitor automatically defines its value users cannot change itin any way Action Do not attempt to change the built in value of the ShutdownScript attribute e ADM 63 System name lt sysnode gt is unknown This message can occur in these scenarios The name of the SysNode specified in hvswitch is not included in the current configuration nvswitch f1 appname sysnode The name of the SysNode specified for hvshut s sysnode is not a valid one i e sysnode is not included in the current configuration The name of the SysNode specified for hvutil ou is unknown hidden options Action Specify a SysNode that is included in the current configuration i e appears in the configname us file e ADM 67 sysnode Cann
162. ctions for configuring and administering the PRIME CLUSTER Cluster Foundation U42117 J Z100 4 76 Preface Related documentation e Cluster Foundation CF Configuration and Administration Guide Linux Provides instructions for configuring and administering the PRIME CLUSTER Cluster Foundation e Reliant Monitor Services RMS Solaris Linux Troubleshooting Guide Describes diagnostic procedures to solve RMS configuration problems including how to view and interpret RMS log files Provides a list of all RMS error messages with a probable cause and suggested action for each condition e Scalable Internet Services SIS Solaris Linux Configuration and Administration Guide Provides information on configuring and administering Scalable Internet Services SIS e Global Disk Services Solaris Configuration and Administration Guide Provides information on configuring and administering Global Disk Services GDS e Global File Services Solaris Configuration and Administration Guide Provides information on configuring and administering Global File Services GFS e Global Link Services Solaris Configuration and Administration Guide Redundant Line Control Function Provides information on configuring and adminis tering the redundant line control function for Global Link Services GLS e Global Link Services Solaris Configuration and Administration Guide Multipath Function Provides information on configurin
163. cy check the wizard informs you what to do next see Figure 23 62 U42117 J Z100 4 76 Configuration example Creating an application Consistency check Yet to do process the basic settings using Machines Basics Yet to do choose a proper application name Settings of turnkey wizard DEMO 1 HELP 2 NO SAVE EXIT 3 SAVE EXIT 4 REMOVE EXIT 5 ApplicationName APP1 6 BeingControlled no 7 Machines Basics Choose the setting to process 7 Figure 23 Prompting for further actions At each step the wizard checks the consistency of the application being configured Only consistent applications are allowed to be part of the high avail ability configuration If you want to specify a different application name you could do it here by selecting 5 ApplicationName However because we are using the default of APP1 the Yer to do message will disappear after you select 7 Machine Basics U42117 J Z100 4 76 63 Entering Machines Basics settings Configuration example 45 Entering Machines Basics settings gt Select Machines Basics by entering the number 7 The Machines Basics menu appears Figure 24 onsistency check Machines Basics appl consistent 1 HELP 14 AutoStartUp no Zis 15 AutoSwitchOver No 3 SAVE EXIT 16 PreserveState no 4 REMOVE EXIT 17 PersistentFault 0 5 AdditionalMachine 18 ShutdownPriority 6 AdditionalConsole 19 OnlinePriority 7 Machin
164. d e Command timed out Action none specified e Could not open localfile or could not create temporary file filename If during hvrcp the localfile cannot be opened for reading or the temporary file filename cannot be opened for writing this message is printed and hvrcp exits with exit code 7 Action Check the permissions on the localfile to make sure that it is readable e Could not restart RMS RELIANT_PATH not set When the detector restarts RMS it checks the value of the environment variable RELTANT_PATH if it cannot get the value of this variable this message is printed Action Make sure that RELIANT_PATH is set to an appropriate value e Delay delay seconds This is an informational message specifying the delay delay in seconds that hvsend has been provided Action None required 300 U42117 J Z100 4 76 Console error messages Console messages in alphabetical order e DISCLAIMER The hvdump utility will collect the scripts configuration files log files and any core dumps These will be shipped out to RMS support If there are any propri etary files you do not want included please exit now Do you want to proceed yes continue no quit This message is printed out on executing hvdump E and will collect the necessary information only if the answer to the above question is yes Action Respond to the prompt e DISCLAIMER The hvdump utility will now collect the necessary info
165. d ScriptTimeout 300 o e app FaultScript Jusrioptireliant binitools d hvalert ANY ERROR Sysnode fuji2RMS faulted O nirus o D a2 o D ami EEE TEETH Online it Offline Deact Faulted Unknown Inconsistent stand By Warning Offline Fault Ter msancs sis msg Java Applet Window Figure 71 Configuration information or object attributes U42117 J Z100 4 76 99 Using Cluster Admin Administration 5 2 4 3 Command pop ups You can perform many operations on the RMS tree objects by using the context sensitive command pop up menus Invoke the pop up menu by right clicking with the mouse on the object The menu options are based on the type and the current state of the selected object Figure 72 EA Cluster Admin 01 PRIMECLUSTER a 4 Cluster Admin File Tools Preferences Help Bru Attributes 10 mie odo 00 Gronh Value View switchlog c mydemo Save logs hvlogclean Delete logs hwlogclean d View Environment o optireliant binitoo s d hvalert ANY ERROR Sysnode fuji2RMS faulted Shutdown RMS PRIMECLUSTER Configuration Services PCS EZ Online Deact Faulted Inconsistent stand By Warning OfflineFault Ter rmsapes sis msg Java Applet Window Figure 72 Command pop up For example the menu offers different options
166. dded removed or renamed e Two or more clusters were merged into one e File systems or SANs were added or removed For convenience the site preparation descriptions for hosts file systems and networks are duplicated here If any of these specifications have changed since your initial RMS installation you should review this material and make the necessary adjustments before proceeding with your RMS configuration The modifications generally involve adding RMS specific entries to standard system files pre existing entries required for proper operation of your hosts and network are not affected Resources for market specific applications may require similar customization See the section Further reading on page 55 for more details 3 2 1 Network e etc hosts Must contain the IP addresses and RMS names of all the host systems that are part of the cluster RMS uses its own internal set of host names to manage the machines in the cluster When you configure the cluster you will use the RMS host names and not the standard host names These names must be entered in etc hosts on each system in the cluster to avoid problems should access 34 U42117 J Z100 4 76 Using the Wizard Tools interface Site preparation to the DNS fail If you used Cluster Admin to configure CIP for RMS then etc hosts will already contain the correct RMS node names described below By default the names follow the conventions in Table 4
167. details 4 If you cannot resolve an error condition with the GUI you can use the command line interface Use standard UNIX commands 5 If a problem persists check if it is a non RMS issue and refer to the appro priate manual 6 Check for system related issues like operating system hardware or network errors 7 Contact field support if you cannot resolve the issue 170 U42117 J Z100 4 76 Troubleshooting Debug and error messages 7 2 Debug and error messages RMS writes debug and error messages to log files when its components such as the base monitor or detectors operate The default setting is for RMS to store these files in the var opt SMAWRrms 1log directory Users can change the directory with the RELIANT_LOG_PATH environment variable which is set in the hvenv Tocal file When RMS starts logging begins The default setting is for the base monitor to write all error messages to its log file orto stderr Normally you do not need to change the default setting because the default options allow for very detailed control of debug output If required you can use the base monitor to record every state and message of any node However in most cases the information requires a detailed knowledge of internal RMS operation to interpret the debug output which can only be evaluated by service personnel For the administrator of an RMS cluster evaluating the switchl og file is normally sufficient This file records al
168. dmin File Tools Preferences Help QW rus HO nizrms D hoz De am MO tujisrms o DO appz oD0 am Online it Offline Deact Faulted Unknown Inconsistent stand By ing Offline Fault Fer rmsapes sis is Java Applet Window BEES E Environment Global Environment Cluster wide RMS Attribute Value RELIANT_PATH JopySMAWISMAWRrms RELIANT_LOG_PATH NarlopuSMAWRrmsflog RELIANT_LOG_LIFE 7 RELIANT_SHUT_MIN_WWAIT 150 Hv_AUTOSTART_WAIT 60 HV_CHECKSUM_INTERVAL 1120 HV_LOG_ACTION_THRESHOLD 98 HV_LOG_WARN_THRESHOLD 95 HV_WAIT_CONFIG 120 HV_RCSTART 1 Figure 110 Clusterwide environment variables 142 U42117 J Z100 4 76 Administration RMS procedures Display local environment variables as follows gt Right click on a node in the RMS tree window and select View Environment in the command pop up Figure 111 File Tools Preferences Help ape View Graph Value View switchlog Save logs hviogclean MO nr o DO a o DO a Delete logs huogclean d lo UserApplicationiapp2 Shutdown RMS 12 Online it Offline Deact Faulted Unknown Inconsistent Stand By Warning Offline Fault rms amp pcs Java Applet Window PRIMECLUSTER Configuration Services PCS PreOnlineScript PreOfMineScript OflineDoneScript
169. dministering RMS is the Cluster Admin GUI Both the RMS Wizards and Cluster Admin call the RMS CLI and under certain conditions you may find the CLI useful For example to manually switch a user application to another node in the cluster use the following CLI command gt hvswitch userApplication SysNode In this case userApplication is the user application that the user wants to switch to the system node SysNode Table 1 lists the RMS CLI commands available to administrators Refer to the chapter Appendix List of manual pages on page 349 for additional infor mation on RMS CLI commands U42117 J Z100 4 76 23 RMS components Introduction Command Function hvassert Tests an RMS resource for a specified resource state lt can be used in scripts when a resource must achieve a specified state before the script can issue the next command Does not require root privilege hvattr Provides an RMS Wizard interface for changing the AutoSwitchOver attribute at runtime The change can be made from a single node in the cluster and will be applied clusterwide for one or more userApp1ication objects in the currently running configuration The values No HostFailure ResourceFailure or ShutDown may be specified hvattr command arguments are specific to RMS configu ration files The user should be familiar with the RMS Wizards hvem Starts the base monitor and the detectors for all monitored resources
170. e command has failed due to the process exiting by using an exit call this message is printed out to the switchlog along with the reason for this failure printed out Action Check the switchlog for finding the reason for this failure and rectify it before reissuing the hvmod command e ADC 43 The file transfer for lt filename gt failed in command The dynamic modification will be aborted During dynamic modification files containing modification information are transferred between the hosts of the cluster If for any reason a file transfer fails the dynamic modification is aborted Action Make sure that host and cluster conditions are such that command can be safely executed U42117 J Z100 4 76 201 ADC Admin configuration Non fatal error messages e ADC 44 The file transfer for lt filename gt failed in command The join will be aborted When a host joins a cluster it receives a cluster configuration file If for any reason a file transfer fails the dynamic modification is aborted Action Make sure that host and cluster conditions are such that command can be safely executed e ADC 45 The file transfer for lt filename gt failed in command with errno lt errno gt errorreason The dynamic modification will be aborted During dynamic modification files containing modification information are transferred between the hosts of the cluster If for any reason a file transfer fails th
171. e and time U42117 J Z100 4 76 105 Using Cluster Admin Administration Nar optireliantiog switchlog on fuji2RMS Time Filter enano Sorttime 2003 iy fo Emio 7 En po Ena Time 2003 tly a timi1e 2jp fe Eh 30 Zim Keyword Filter Severity No Selection Y Non zero exit code I iter 2003 09 16 08 21 41 368 SWT 39 NOTICE Processing normal switch request for application apa 2003 09 16 08 21 41 449 UAP 13 NOTICE app2 AdminSwitch application is expected to go onlin 2003 09 16 08 21 41 466 US 22 NOTICE app2 starting PreCheck 2003 09 16 08 21 42 890 US 27 NOTICE app2 PreCheck successful 2003 09 16 08 21 42 890 US 17 NOTICE app2 starting Online processing 2003 09 16 08 21 42 899 UAP 13 NOTICE app1 AdminSwitch application is expected to go onlin 2003 09 16 08 21 42 903 US 22 NOTICE app1 starting PreCheck i Status Done Detach Remove Help Figure 79 Search based on date and time filter 106 U42117 J Z100 4 76 Administration Using Cluster Admin You can also search the text in the application log by right clicking on the displayed text This brings up a small command pop up with a Find option Figure 80 Narioptireliantogiswitchlog on Tui2RMS Time Filter Enable Start Time 2003 lv fo mhe r Eh 20 Elm End Time 2003 zy a mio jp e ch 30 2 m
172. e default values of the environment variables are found in lt RELIANT_PATH gt bin hvenv They can be redefined in the hvenv local command file The following list describes the global environment variables for RMS e HV_AUTOSTARTUP_IGNORE Possible values List of RMS cluster nodes The list of RMS cluster nodes must be the names of the SysNodes as found in the RMS configuration file The list of nodes cannot include the CF name Default empty List of cluster nodes that RMS ignores when it starts This environment variable is not set by default A user application will begin its automatic startup processing if the AutoStartUp attribute is set and when all cluster nodes defined in the user application have reported On1 ine If a cluster node appears in this list automatic startup processing will begin even if this node has not yet reported the On1 ne state U42117 J Z100 4 76 341 Global environment variables Appendix Environment variables Use this environment variable if one or more cluster nodes need to be taken out of the cluster for an extended period and RMS will continue to use the configuration file that specifies the removed cluster nodes In this case specifying the unavailable cluster nodes in this environment variable ensures that all user applications are automatically brought online even ifthe unavailable cluster nodes do not report Online Caution If this environment variable is used ensure that it
173. e dynamic modification is aborted A specific reason for this failure is referred to by the OS error code ERRNO and its explanation in ERRORREASON Action Make sure that host and cluster conditions are such that command can be safely executed e ADC 46 The file transfer for lt filename gt failed with unequal write byte count expected expectedvalue actual actualvalue The dynamic modification will be aborted During dynamic modification files containing modification information are transferred between the hosts of the cluster During the transfer RMS keeps track of the integrity of the transferred data by counting the bytes transferred This count can be incorrect if the transfer process is broken or interrupted Action Make sure that host cluster and network conditions are such that command can be safely executed e ADC 47 RCP fail can t open file filename If the file lt filename gt that has been specified as the file to be copied from the local host to the remote host cannot be opened for reading this message is the result Action 202 U42117 J Z100 4 76 Non fatal error messages ADC Admin configuration Make sure that the file lt filename gt is readable e ADC 48 RCP fail fseek errno errno During a file transfer between the hosts RMS encountered a problem indicated by the OS error code ERRNO Action Make sure that the host cluster and network conditions are such that file transfer pro
174. e gt to other node lt othersysnode gt but one of the following condi tions was encountered lt othersysnode gt is not a valid name lt othersysnode gt is already used by some other host in the cluster lt othersysnode gt is not a resource lt othersysnode gt is a controlled application Action Choose another valid host name e ADM 90 70 Dynamic modification failed cannot change attribute Resource of the controller object lt controllernode gt from lt oldresource gt to lt newresource gt because some of lt oldresource gt are going to be deleted This message appears when the user tries to rename a resource that is controlled by a controller object and is going to be deleted Action Make sure deleted applications are not referred to from any controller 224 U42117 J Z100 4 76 Non fatal error messages ADMI Admin command and detector queues e ADM 91 71 Dynamic modification failed controller lt controller gt has its Resource attribute set to lt resource gt but application named lt appname gt is going to be deleted This message appears when the user tries to control a resource lt resource gt with a controller lt controller gt but the application associated with that resource is going to be deleted Action Make sure the controller s Resource attribute does not refer to a deleted application e ADM 95 Cannot retrieve information about command line used when starting RMS Start on
175. e local host has received a message from host host but the local host is unable to resolve the sending host s address This could be due to a misconfiguration This message will be dropped Further such messages will appear in the switchlog RMS on the local host has received a message from host host whose address is not resolvable by the local host Action Make sure that the local host is able to resolve the remote host host s address by checking for any misconfigurations e WRP 30 RMS on the local host has received a message from host host but the local host is unable to resolve the sending host s address This message will be dropped Please check for any misconfiguration RMS on the local host has received a message from host host whose address is not resolvable by the local host Action Make sure that the local host is able to resolve the remote host host s address by checking for any misconfigurations e WRP 31 RMS has received a message from host host with IP address receivedip The local host has calculated the IP address of that host to be calcip This may be due to a miscon figuration in etc hosts Further such messages will appear in the switchlog U42117 J Z100 4 76 277 WRP Wrappers Non fatal error messages The local host has received a message from host host with IP address receivedip which is different from the locally calculated IP address for that host Action Check etc hosts for any mi
176. e name and configuration scripts Users can specify attributes in any order in the object definition Refer to the chapter Appendix Attributes on page 325 for the supported types their associated values and a description of each attribute This information is provided for reference material The values are deter i mined by the RMS Wizards during the Configuration Generate phase of the configuration process Refer to the chapter Using the Wizard Tools interface on page 31 26 U42117 J Z100 4 76 Introduction Environment variables 2 9 Environment variables RMS uses global and local environment variables e Global variables must have the same setting on all nodes in the cluster RMS maintains global environment variables in the ENV object and in the opt SMAW SMAWRrms bin hvenv configuration file e Local variables override global variables and can differ from node to node RMS maintains local environment variables in the ENVL object and in the opt SMAW SMAWRrms bin hvenv 1ocal configuration file If the RELTANT_PATH global variable has been redefined global and local variables are located in the RELIANT_PATH bin hvenv and RELIANT_PATH bin hvenv local files respectively RMS creates the ENV or ENVL objects dynamically using the contents of the hvenv and hvenv Tocal files when the base monitor starts up Values in the ENVL object override values in the ENV object See the section Setting environment
177. e resource s supposed to come online failed During dynamic modification when new resource s that are to be added to a parent object that is online by executing the online scripts cannot be brought online dynamic modification is aborted Action Make sure the new resource s can be brought to the online state and reissue the hvmod command e ADM 5 17 Dynamic modification failed object lt object gt is not linked to any application During dynamic modification if there is an attempt to add an object lt object gt that does not have a parent and hence not linked to any userApplication this message is printed and dynamic modification is aborted Action Make sure that every object being added during dynamic modification is linked to a userApplication e ADM 6 36 Dynamic modification failed cannot add new resource lt resource gt since another existing resource with this name will remain in the configuration When RMS receives a directive to add a new resource lt resource gt with the same name as that of an existing resource this message is printed out to the switchlog and dynamic modification aborts Action Make sure that when adding a new resource its name does not match the name of any other existing resource e ADM 7 35 Dynamic modification failed cannot add new resource lt resource gt since another existing resource with this name will not be deleted When RMS receives a directive to add
178. e to be specified in the hvenv and hvenv Tocal files A tmp directory that is nearly full may result in RMS errors because hvenv uses this directory to sort RMS environment variables You can change the hvenv 1ocal file on a node in the cluster but the hvenv file must not be changed on any node To activate your changes you must stop RMS and restart it Caution RMS environment variables cannot be set in the user environment explicitly Doing so can cause RMS to lose environment variables settings The values of environment variables are specified as export directives in these files An example of an export directive would be as follows export SCRIPTS_TIME_OUT 200 28 U42117 J Z100 4 76 Introduction Directory structure You should change environment variables before running the configuration file While RMS is running you can display the environment variables with the hvdisp command which does not require root privilege e hvdisp ENV e hvdisp ENVL 2 10 Directory structure RMS software consists of a number of executables scripts files and commands all located relative to the directory specified in the RELIANT_PATH environment variable Table 2 illustrates the directory structure of the RMS software after it has been correctly installed Name Contents RELIANT_PATH Base directory Default opt SMAW SMAWRrms lt RELIANT_PATH gt bin Executables including detectors commands and scripts lt
179. e user tries to start the RMS with a configuration different from the configuration present in the RMS default configuration file The base monitor is not started the user will need to either change the default configuration file by re activating the configu ration via the Wizard Tools hvw command or specify the proper option argument for the c option Action The user should correct the default configuration by activating the specified configuration file using the Wizard Tools or specify the proper option argument to the c option e The file filename could not be opened errormsg While performing a hvdump if the file filename could not be opened because of errormsg this message is the result and hvdump exits with exit code 8 Action Take action based on the errormsg 314 U42117 J Z100 4 76 Console error messages Console messages in alphabetical order e The length of return message from BM is illegal actuallength actual expectedlength expected When the hvassert utility expecting a return message from the base monitor receives a message of length actuallength when it is expecting a message of length expectedlength this message is printed and hvassert exits with exit code 5 Action Contact field support e The system call systemcall could not be executed errormsg While performing a hvdump if the systemcall could not be executed because of errormsg this message is the result and hvdump exits with exit
180. eaf object and this type lt type gt does not have a detector Leaf objects must have detectors An object that has no children objects i e a leaf object is of type type that has no detectors in RMS All leaf objects in RMS configurations must have detectors Action Redesign your configuration so that all leaf objects have detectors BAS 16 ERROR IN CONFIGURATION FILE The object object has an empty DeviceName attribute This object uses a detector and therefore it needs a valid DeviceName attribute Critical internal error If this message appears in switchlog it indicates a severe problem in the base monitor Action Contact field support BAS 17 ERROR IN CONFIGURATION FILE The rName is lt rname gt its length length is larger than max length maxlength The value of the rName attribute exceeds the maximum length of maxlength characters Action Specify shorter rName U42117 J Z100 4 76 229 BAS Startup and configuration errors Non fatal error messages e BAS 18 ERROR IN CONFIGURATION FILE The duplicate line number is lt linenumber gt this message prints out a line number of the duplicate line in hvgdstartup file Action Make sure that file hvgdstartup has no duplicate lines e BAS 19 ERROR IN CONFIGURATION FILE The NoKindSpecified ForGdet is lt kind gt so no kind specified in hvgdstartup The kind has not been specified for the generic detector in the hvgdstartup file Act
181. eating and editing a configuration e Configuration Remove Removes an existing high availability configuration e Configuration Freeze Prevents further changes to a configuration With this option the configuration can be viewed but not modified Configuration Freeze is password protected you will be prompted to create a password before the configuration is locked e Configuration Thaw Releases the configuration from the frozen state Configuration Thaw is password protected you must enter the correct password before the configuration is unlocked e Configuration Edit Global Settings Modifies settings that affect the entire configuration This includes settings for the detectors and the operation mode of the hvw command This item is also used to specify the alternate interconnects Alternatelps for the cluster e Configuration Consistency Report Provides a consistency check that verifies whether an application is running within a high availability configuration and has actually been created using the configuration data provided by the respective wizard The wizard compares the currently activated wizard checksum against the wizard database checksum One checksum is called the Live Info the other is called the BuildInfo If both checksums match for an application it is certified that its running version conforms to what was configured by the wizard e Configuration ScriptExecution Allows administrators to run an
182. eckScript e OnlineScript e OfflineScript State triggered scripts are as follows e PostOnlineScript e PostOfflineScript e OfflineDoneScript e FaultScript e WarningScript e StateChangeScript Post online and post offline scripts are generally state triggered scripts For example if an online script executes successfully RMS invokes the PostOn lineScript when the resource goes online A similar situation is applicable for the PostOfflineScript Scripts are attributes of nodes The use of scripts is always optional The base monitor interprets unused script attributes except the ShutdownScript as scripts that terminate immediately and successfully that is as a script that contains only the line exit 0 Ifthe ShutdownScript is not defined then it is ignored All script types can be used with all nodes except for SysNodes for which only a FaultScript and a ShutdownScript can be defined Any changeover of a SysNode to the Online and Offline states is not subject to the control of RMS 150 U42117 J Z100 4 76 Advanced RMS concepts Initializing 6 3 Initializing After RMS starts the initial state of all nodes is Unknown RMS changes this state after the node has the necessary information for identifying the actual state The following is necessary information for identifying the state e For nodes with a detector First report of the detector e For nodes with children Messages of the children concerning the
183. eckScript hvexec p app2 mydemo PreOnlineScript rm flusrioptireliantitmip app2 goingofMine if H _INTENDED_STATE YOnlil PreOfflineScript hvenable app2 ALL rm f usr optrelianttmp app2 online touch usroptreliantt OffineDoneScript rm f usroptrelianttmp app2 goingoffline M Online it Offline Deact Faulted Unknown Inconsistent stand By Warning OfflineFault Ter mmsapcs sis msn e D gt Java Applet Window Figure 106 Switching an application Caution It is recommended that you use the normal mode of switching applica tions to ensure that application and data consistencies and integrity are maintained If an application cannot be switched normally you may use the forced switch mode however a forced switch overrides all safety checks and could even result in data corruption or other inconsistencies 136 U42117 J Z100 4 76 Administration RMS procedures If the application is busy the command pop up will not offer the choices to switch the application Instead the command pop up indicates that the appli cation is busy and that you should try later Figure 107 EA Cluster Admin BEE File Tools Preferences Help Bru BO tujerms D m D am O nens utoSwitchOver No OO app2 lass UserApplicationiapp2 o Ke hutdownPriority 10 View Application Graph View Subapplication Graph fu
184. ector will report offline or faulted When different configurations are encountered in a cluster where one host is offline and the other is online Action Run the same configuration in a single cluster or different clusters do not have common hosts SYS 15 The uname system call returned with Error RMS will be unable to verify the compliance of the RMS naming convention This message appears when uname system call returned with a non zero value Action Make sure that the SysNode name is valid and restart RMS as needed SYS 17 The RMS internal SysNode name sysnode is ambiguous with the name name Please adjust names compliant with the RMS naming convention SysNode uname n RMS The RMS naming convention _sysnodename_ uname n RMS is intended to allow use of the CF name with and without trailing RMS whenever an RMS command expects a SysNode reference This rule creates an ambiguity if one SysNode is named xxxRMS and another is named xxx because _rms_command_ xxx could refer to either SysNode Therefore ambiguous SysNode names are not be allowed Action Use non ambiguous SysNode names and adhere to the RMS naming conventions SYS 48 Remote host lt hostname gt replied the checksum lt remotechecksum gt which is different from the local checksum lt localchecksum gt The sysnode of this host will not be brought online This message appears when the remote host lt hostname gt is r
185. ed at this stage During the activation phase the wizard executes a series of tasks and displays the status on the screen The completion of a task is indicated by the word done or a similar expression see Figure 16 50 U42117 J Z100 4 76 Using the Wizard Tools interface Activating a configuration About to activate the configuration mydemo Testing for RMS to be up somewhere in the cluster done Arranging sub applications topologically done Check for all applications being consistent done Running overall consistency check done Generating pseudo code one dot per sub application done Generating RMS resourceS cece ee eee eee done hvbuild using usr opt reliant build wizard d mydemo mydemo us About to distribute the new configuration data to hosts fuji2RMS fuji3RMS The new configuration was distributed successfully About to put the new configuration in effect done The activation has finished successfully Hit CR to continue Figure 16 Activating a configuration Among the tasks carried out by Configuration Activate are generation and distri bution of the configuration RMS performs a consistency check of the graph created in the generation of the configuration before distributing the configu ration to all nodes in the cluster The test to see whether RMS is up on one of the nodes in the cluster is required since activation cannot be performed
186. ee display the tree information of the resource database 15 10 RMS System administration hvassert assert test for an RMS resource state hvattr make cluster wide attribute changes at runtime from a single node installed with PCS the Wizard Tools hvem start the RMS configuration monitor hvconfig display or save the RMS configuration file hvdisp display RMS resource information hvdist distribute RMS configuration files U42117 J Z100 4 76 353 RMS Wizards Appendix List of manual pages hvdump collect debugging information about RMS hvgdmake compile an RMS custom detector hvlogclean clean RMS log files hvrclev change default RMS start run level hvreset reinitialize the graph of an RMS user application hvshut shut down RMS hvswitch switch control of an RMS user application resource to another node hvthrottle prevent multiple RMS scripts from running simultaneously hvutil manipulate availability of an RMS resource File formats hvenv local RMS local environment configuration file 15 11 RMS Wizards RMS Wizard Tools and RMS Wizard Kit RMS Wizards are documented as HTML pages in the SMAWRhvdo package on the CD ROM After installing this package the documen tation is available in the following directory lt RELIANT_PATH gt htdocs wizards en Solaris lt RELIANT_PATH gt htdocs 1inux wizards en Linux The default value of lt RELIANT_PATH gt is opt SMAW SMAWRrms 35
187. em conditions including monitoring resources that have no Offline state Physical disks are an example of such nodes because they are monitored but cannot be decon figured For this purpose RMS provides the attribute Li eOff1ine to indicate that the resource has no Offline state This attribute is set by default for physical disks node type disk and does not have to be explicitly specified During offline processing a node identified with LieOff1 ine reacts in the same way as any other node and in particular when all pre post and offline scripts are run The reaction of the node with respect to its parent is also the same as if the node had been successfully deconfigured that is it lies A node with LieOffline set does not wait for an offline report of the detector after the offline script has executed it automatically executes the post offline script An unexpected online report of the detector which arrives after the offline script has executed is not a fault condition in this case 6 6 Fault processing The handling of fault situations is a central aspect of RMS How RMS reacts to faults differs depending on the state of an application at any particular time For instance the reaction to faults that occur in the resource graph of an ongoing application differs from the reaction to faults in the graph of an application that is locally offline 6 6 1 Faults in the online state or request processing When a detector indicat
188. en the user has elected to proceed with the command hvshut f a this message is printed to confirm the choice Action None required e NOTICE User has been warned of hvshut L and has elected to proceed When the user invokes hvshut L and then has elected to proceed with the command then this message is printed to confirm that hvshut L is being invoked Action None required RELIANT_LOG_PATH is not defined When the hvlogc1ean utility is invoked without the d option it needs the value of the environment variable RELIANT_LOG_PATH to get to the hvloginit script If the value of the variable cannot be found this message is the result and the utility exits with exit code 6 Action Make sure that the environment variable RELIANT_LOG_PATH has not been unset and is set to the appropriate value 308 U42117 J Z100 4 76 Console error messages Console messages in alphabetical order e RELIANT_PATH is not defined When the hvlogclean utility is invoked without the d option it needs the value of the environment variable RELTANT_PATH to get to the hvloginit script If the value of the variable cannot be found this message is the result and the utility exits with exit code 6 Action Make sure that the environment variable RELIANT_PATH is set to the appropriate value e Remote host lt hostname gt is not Online When performing hvassert if the remote host hostname is not Online this mess
189. er s Nul1Detector attribute is set to of f Action The controlled resource must be present in the configuration for the controller to work properly CTL 2 Controller lt controller gt detected more than one controlled application Online This has lead to the controller fault Therefore all the online controlled application will now be switched offline 246 U42117 J Z100 4 76 Non fatal error messages CUP userApplication contracts If the controller controller has two or more of the controlled applications Online on one or more hosts then the controller faults Action Make sure that more than one controlled application for a controller is not Online 8 8 CUP userApplication contracts e CUP 2 object cluster is in inconsistent condition current online host conflict received host local onlinenode If the cluster hosts are unable to reach an agreement as to which host is responsible for a particular userApp1ication The most likely reason for this is an erroneous system administrator intervention e g a forced hvswitch request the userAppl1ication is Online on more than one host simultaneously Action Analyze the cluster inconsistency and perform the appropriate action to resolve it If the application is online on more than one host shut down hvutil f the userApplication on all but one host e CUP 3 object is already waiting for an event cannot set timer Critical internal error Action Co
190. errno h Action Depends on the exitcode value 304 U42117 J Z100 4 76 Console error messages Console messages in alphabetical order e hvutil Could not determine IP address of lt targethost gt The name of the cluster host could not be resolved to an IP address Action Add an entry for targethost into the etc hosts file of all cluster hosts e hvutil debug option must be a positive number for on O for off When hvutil L has been invoked with a loglevel that is not one of 0 or 1 this message is the result and it exits with exit code 6 Action Specify a valid logging level of 0 or 1 for the utility e hvutil Detector time period must be greater than minimumtime If the detector time period specified as an argument with hvutil t is less than minimumtime hvuti1 is aborted and exits with exit code 5 Action Invoke hvuti1 with a time period that is greater than minimumtime e hvutil Failed to allocate socket Failed to allocate a socket to communicate with a remote host Action Contact professional services to determine the cause e hvutil Missing etc services entry for rmshb An entry is missing in the etc services file for the RMS heartbeat Action Add an entry on all cluster hosts for rmshb using tcp e hvutil Notify string is longer than mesglen bytes Notify string is too long Action Notify string should not be longer than mesglen bytes U42117 J Z100 4 76 305 Console messages in
191. erties or attributes which limit or define what monitoring or action can occur When a resource is associated with a particular object type attributes associated with that object type are applied to the resource See also generic type RMS online maintenance The capability of adding removing replacing or recovering devices without shutting or powering off the node operating system dependent CF This module provides an interface between the native operating system and the abstract OS independent interface that all PRIMECLUSTER modules depend upon Oracle Real Application Clusters RAC Oracle RAC allows access to all data in a database to users and appli cations in a clustered or MPP massively parallel processing platform Formerly known as Oracle Parallel Server OPS OSD CF See operating system dependent CF parent RMS An object in the configuration file or system graph that has at least one child See also child RMS configuration file RMS system graph RMS PCS See PRIMECLUSTER Configuration Services PCS U42117 J Z100 4 76 367 Glossary primary node RMS The default node on which a user application comes online when RMS is started This is always the nodename of the first child listed in the userApplication object definition PRIMECLUSTER Configuration Services PCS The graphical configuration interface for PRIMECLUSTER products PCS uses standard templates written in Configurat
192. es 0 fuji2RMS 20 StandbyTransitions 8 PreCheckScript 21 LicenseToKill no 9 PreOnlineScript 22 AutoBreak yes 10 PostOnlineScript 23 HaltFlag n0 11 PreOfflineScript 24 PartialCluster 0 12 OfflineDoneScript 25 ScriptTimeout 13 FaultScript Choose the setting to process 5 Figure 24 Consistency check and Machines Basics menu At the top of the menu the wizard shows you the result of the latest consistency check The application named APP1 which was indicated on the previous screen has proven to be consistent The Machines 0 menu item indicates the node where your application will first attempt to come online In this case it is fuji 2RMS The RMS Wizards retrieve the default settings for Machines 0 from the local node defined in RELIANT_HOSTNAME Subsequent Machines items if any indicate the list of failover nodes If the initial node fails RMS will attempt to switch the application to a failover node trying each one in the list according to the index order At this point only the initial node appears in the menu so configure a failover node for your application as follows 64 U42117 J Z100 4 76 Configuration example Entering Machines Basics settings gt Select AdditionalMachine by entering the number 5 A menu containing the current list of available nodes appears Figure 25 1 HELP 2 RETURN 3 fuji2RMS 4 fuji3RMS Choose a machine for this appli
193. es a fault for an online node whose corresponding userApp1 ication is also online RMS executes the fault script of the node An equivalent fault condition occurs if the detector indicates that a previously online node is offline although no request is present U42117 J Z100 4 76 159 Fault processing Advanced RMS concepts After the fault script completes RMS notifies the parents of the fault The parents also execute their fault scripts and forward the fault message A special case is represented by OR nodes These react to the fault message only if no other child is online If another child of the parent is online RMS termi nates the fault processing at this point If there is no intermediate OR node that intercepts the fault message it reaches the userApplication The userAppl ication then executes its fault script There are four subsequent cases possible during processing These attributes are set for the userAppl ication in the configuration file according to the needs of the application These fault processing combinations are as follows e AutoSwitchOver is set e PreserveState is set but AutoSwitchOver is not set e Neither AutoSwitchOver nor PreserveState are set e Both AutoSwitchOver and PreserveState are set AutoSwitchOver only If the AutoSwitchOver attribute is set for the userApplication the process is as follows 1 The userApplication attempts to initiate the switchover procedure For this purpose the
194. esource failing to come Offline after running its OfflineScript offlinescript After a resource executes its offline script it is expected to come Offline If it does not change its state or transitions to a state other than Offline within the period of seconds specified by its ScriptTimeout attribute the resource is considered as being Faulted Action Make sure the Offline script moves the resource into Offline state e DET 6 FAULT REASON Resource lt resource gt transitioned to a Faulted state due to the resource failing to come Online after running its OnlineScript onlinescript After a resource executes its online script it is expected to come Online If it does not change its state or transitions to a state other than Online within the period of seconds specified by its ScriptTimeout attribute the resource is considered as being Faulted U42117 J Z100 4 76 249 DET Detectors Non fatal error messages Action Make sure the Online script moves the resource into Online state e DET 7 FAULT REASON Resource lt resource gt transitioned to a Faulted state due to the resource unexpectedly becoming Offline This message appears when the resource becomes Offline unexpectedly Action Check to see why the resource suddenly transitioned to the Offline state e DET 11 DETECTOR STARTUP FAILED Corrupted command line lt commandline gt Critical internal error This message occurs when the command
195. esource from one state to another for example from Of f 1 ine to Onl ine The two types of scripts are as follows e Redquest triggered scripts initiate a state change to a resource The request triggered scripts are as follows InitScript Runs only once when RMS is first started PreCheckScript Determines if Online or Standby processing is needed or possible PreOfflineScript Prepares a transition to an Offline state 22 U42117 J Z100 4 76 Introduction RMS components e State triggered scripts react to specific events The state triggered scripts are as follows OfflineScript Transitions a resource to an Offline state PreOnlineScript Prepares a transition to an Online state OnlineScript Transitions a resource to an Online state PostOnlineScript Reaction to the transition to the Online state PostOfflineScript Reaction to the transition to the Of f1 ine state Off1ineDoneScript Reaction to a userApplication reaching the Offline state FaultScript Reaction to a resource transitioning to the Faul ted state WarningScript Reaction to a detector reporting the Warning state StateChangeScript Reaction to a scalable controllers userApp1 i cation or SysNode changing state Scripts for common system functions are included with the subapplications provided by the Wizard Tools 2 6 4 RMS CLI The primary interface for configuring RMS is the RMS Wizards and the primary interface for a
196. exceed maxhosts 226 U42117 J Z100 4 76 Non fatal error messages BAS Startup and configuration errors e ADM 107 The cumulative length of the SysNode names specified in the configuration for the userApplication lt appname gt is length This exceeds the maximum allowable length which is maxlength The cumulative length of the SysNode names specified in the configu ration for application appname exceeds the maximum allowable limit Action Limit the length of the SysNode names so that they fit within the maximum allowable limit 8 3 BAS Startup and configuration errors e BAS 2 Duplicate line in hvgdstartup If RMS detects that a line has been duplicated in the hvgdstartup it prints this error message The end result of this is that RMS will exit with exit code 23 Action Only unique lines are allowed in hvgdstartup Remove all the duplicate entries e BAS 3 No kind specified in hvgdstartup In the hvgdstartup file the entry for the detector is not of the form gN t lt n gt k lt n gt or the k lt n gt 4 option is missing Since RMS is unable to start it exits with exit code 23 Action Modify the entry for the detector so that the kind k lt n gt option for the detector is specified properly BAS 6 DetectorStartScript for kind lt kind gt cannot be redefined while detector is running During dynamic modification if there is an attempt to redefine the kind for the DetectorStartScrip
197. f available machines Figure 50 1 HELP 2 RETURN 3 fuji2RMS 4 fuji3RMS Choose a machine for this application 4 Figure 50 List of nodes for failover procedure As with the former application the additional machine to be specified for the failover procedure is fuji3RMS gt Select fwji3RMS by entering the number 4 In the screen that follows you see your selection confirmed Figure 51 The 8 Machines 1 item now displays f uj i3RMS as the additional machine APP2 will be switched over to this machine if fuj i2RMS fails Consistency check Machines Basics app2 consistent 1 HELP 14 FaultScript Zis 15 AutoStartUp no 3 SAVE EXIT 16 AutoSwitchOver No 4 REMOVE EXIT 17 PreserveState no 5 AdditionalMachine 18 PersistentFault 0 6 AdditionalConsole 19 ShutdownPriority 7 Machines 01 fuji2RMS 20 OnlinePriority 8 Machines 1 fuji3RMS 21 StandbyTransitions 9 PreCheckScript 22 LicenseToKill no0 10 PreOnlineScript 23 AutoBreak yes 11 PostOnlineScript 24 HaltFlag n0 12 PreOfflineScript 25 PartialCluster 0 13 0fflineDoneScript 26 ScriptTimeout Choose the setting to process 3 Figure 51 Machines Basics menu Save your settings and exit this part of the configuration procedure 82 U42117 J Z100 4 76 Configuration example Setting up a controlling application gt Select SAVE EXIT by entering the number 3 This takes you to
198. f controlled application 13 Figure 4 Follow mode switchover 14 Figure 5 Scalable mode controlled child application switchover 15 Figure 6 Scalable mode controlling parent application switchover 15 Figure 7 Relationship between RMS and RMS Wizards 18 Figure 8 NFS Lock Failover screen 38 Figure 9 Main configuration menu when RMS is not active 42 Figure 10 Main configuration menu when RMS is running 46 Figure 11 Application type selection 47 Figure 12 Menu leading to basic settings 48 Figure 13 Menu to configure basic settings 48 Figure 14 Menu to configure non basic settings 49 Figure 15 Main configuration menu 50 Figure 16 Activating a configuration 51 Figure 17 Quitting the Main configuration menu 52 Figure 18 Main configuration menu 58 Figure 19 Add hosts to a cluster menu 60 Figure 20 Remove hosts from a cluster menu 61 Figure 21 Main configuration menu 61 Figure 22 Application type selection menu 62 Figure 23 Prompting for further actions 63 Figure 24 Consistency check and Machines Basics menu 64 U42117 J Z100 4 76 379 Figures Figure 25 Figure 26 Figure 27 Figure 28 Figure 29 Figure 30 Figure 31 Figure 32 Figure 33 Figure
199. f it becomes faulted The values can be combined using the vertical bar I character The No value cannot be combined with any other value For backward compatibility the numeric values O and 1 are accepted 0 is equivalent to No and 1 is equivalent to HostFailure Resource Failure e ClusterExclusive Possible Values 0 1 Default 0 326 U42117 J Z100 4 76 Appendix Attributes Attributes available to the user Valid for resource objects If set to 1 guarantees that the resource is Online on only one node in the cluster at any time If set to 0 allows a resource to be Online on more than one node at a time The user can modify this attribute for a cmd1 ine subapplication only The configuration tools control this attribute for all other subapplications e FaultScript Possible Values Valid script character Default empty Valid for all object types Specifies a script to be run if the associated resource enters the Faul ted state e Follow Possible Values 0 1 Default 0 Valid for controller objects Specifies whether or not the object is a Fo110w controller The user changes this attribute indirectly by selecting the controller type in the configuration interface If set to 1 the controller operates in Fol low mode When the parent appli cation is switched On ine then all child applications also come Online on the same node regardless of the order specified in their respective Prior
200. face See also Cluster Foundation CF Reliant Monitor Services RMS PRIME CLUSTER Configuration Services PCS Scalable Internet Services SIS Web Based Admin View U42117 J Z100 4 76 359 Glossary Cluster Configuration Backup and Restore CCBR provides a simple method to save the current PRIMECLUSTER configuration information of a cluster node It also provides a method to restore the configuration information Cluster Foundation CF The set of PRIMECLUSTER modules that provides basic clustering communication services See also base cluster foundation CF cluster interconnect CF The set of private network connections used exclusively for PRIME CLUSTER communications Cluster Join Services CF This PRIMECLUSTER module handles the forming of a new cluster and the addition of nodes concatenated virtual disk RCVM Concatenated virtual disks consist of two or more pieces on one or more disk drives They correspond to the sum of their parts Unlike simple virtual disks where the disk is subdivided into small pieces the individual disks or partitions are combined to form a single large logical disk See also mirror virtual disk RCVM simple virtual disk RCVM striped virtual disk RCVM virtual disk Configuration Definition Language PCS The syntax for PCS configuration templates See also PRIMECLUSTER Configuration Services PCS configuration file RMS The RMS configuration file that defines
201. fflineScript Possible Values Valid script character Default empty Valid for all objects except SysNode objects Specifies the script to be run after the state of the associated resource changes to Offline e PostOnlineScript Possible Values Valid script character Default empty Valid for all objects except SysNode objects Specifies the script to be run after the state of the associated resource changes to Online e PreOfflineScript Possible Values Valid script character Default empty Valid for all objects except SysNode objects Specifies the script to run before the object is taken to the Offline state 330 U42117 J Z100 4 76 Appendix Attributes Attributes available to the user e PreOnlineScript Possible Values Valid script character Default empty Valid for all objects except SysNode objects Specifies the script to be run before the associated resource is taken to the Online state e PreserveState Possible Values O 1 Default 0 Valid for userApp1ication objects Specifies that resources are not to be taken Offline after a fault Ignored if AutoSwitchOver is not set to No e PriorityList Possible Values Valid list of SysNode names character Default empty Valid for userApplication objects Contains a list of SysNode objects where the application can come On1 ine The order in the list determines the next node to which the application is switc
202. figuration procedure on page 40 outlines the four major steps involved in every configuration procedure e The section Creating and editing a configuration on page 40 describes the wizard interface and how it is used to specify a configuration e The section Activating a configuration on page 49 describes how to activate a configuration after it has been created or modified e The section Configuration elements on page 53 provides additional details about basic RMS elements specified in every configuration e The section Further reading on page 55 contains a list of related documents that provide additional information about the wizards All the following procedures assume the Cluster Foundation CF software has been properly installed configured and started See the Cluster Foundation CF Configuration and Administration Guide for details 3 1 Overview The chapter Introduction on page 9 describes the components necessary for configuring applications for high availability It is extremely important that you define applications and the resources that are used by them Resources are entities like disks file systems processes IP addresses and so forth This definition also needs to include the following information e How the applications and their resources are related to each other e What scripts bring resources Online and Offline e Which detectors monitor the state of which resources U421
203. for backward compatibility only and support for it may be withdrawn in future releases Therefore it is recommended that only the attribute DetectorStartScript be used for setting new configurations The attribute DetectorStartScript and the file hvgdstartup are mutually exclusive Action Make sure that the DetectorStartScript be used for setting new configurations as support for hvgdstartup may be discontinued in future releases e BM 75 88 Dynamic modification failed controller lt controller gt has its attributes SplitRequest IgnoreOnl ineRe quest and IgnoreOfflineRequest set to 1 If SplitRequest is set to 1 then at least one of IgnoreOfflineRequest or IgnoreOnlineRequest must be set to 0 238 U42117 J Z100 4 76 Non fatal error messages BM Base monitor Invalid combination of controller attributes is encountered If both IgnoreOfflineRequest and IgnoreOnl ineRequest are set to 1 then no request will be propagated to the controlled application s so no request can be split Action Provide a valid combination of the controller attributes e BM 80 92 Dynamic modification failed controller lt controller gt belongs to the application lt application gt which AutoSwitchOver attribute has ShutDown option set but its controlled application lt controlled gt has not If a controlling application has its AutoSwitchOver attribute set with the option Shutdown then all applications controlled by the cont
204. ful in understanding RMS Every object is an independent instance that carries out actions typically imple mented by shell scripts according to rules based on its state and messages received from detectors or other objects States detectors and scripts were introduced in the chapter Introduction on page 9 The following sections provide more details about RMS internal structure and inter object communi cation 6 1 1 Configuration structure The following rules apply to RMS configurations U42117 J Z100 4 76 147 Internal organization Advanced RMS concepts e There must be a SysNode object for every node host in the cluster e AUserApplication object is a child of every SysNode on which it may run Therefore a UserApplication has multiple SysNode parents e UserApplication objects have one child each for each SysNode on which they may run That child is usually an andOP object type and must have its HostName attribute set to the SysNode name to which it refers By default the configuration wizards generate a name of the form lt application_name gt _Host_ lt hostname gt for each of these andOP objects e Each SysNode and UserApplication object can appear only once in the graph e Every instance of an object can only be used once in a configuration e Objects that belong to different UserApp1 ication object types cannot depend on each other e A leaf object must always have a detector e There must not be
205. g that operator intervention is required After determining that operator intervention is required the operator must perform the following 1 Manually shut down the cluster host indicated by the SysNode in the Wait state 2 Issue the hvutil u SysNode_name command on a surviving cluster host 6 7 Switch processing The switch processing procedure ensures that an application switches over to another host in the cluster 6 7 1 Switch request Switch requests are divided as follows e Priority switch request RMS identifies the target host according to the host priority as defined in the configuration see the description of the PrioityList attribute in the chapter Appendix Attributes on page 325 e Directed switch request The user specifies the target host The types of switches are divided as follows U42117 J Z100 4 76 165 Switch processing Advanced RMS concepts e Switchover The application running on a host is to be switched over to another host e Switch online An application that is not running on any host is started or the host on which it has previously been running has failed In switch processing RMS performs the activities in Table 6 depending on the switch scenario means of the command interface priority as well as directed switch request are both possible Activity Switch Switch over online The userApplication generates a switch request X X when
206. g and administering the multipath function for Global Link Services GLS e Data Management Tools Solaris Configuration and Administration Guide Provides reference information on the Volume Manager RCVM and File Share RCFS products e SNMP Reference Manual Solaris Linux Provides reference information on the Simple Network Management Protocol SNMP product e Release notices for all products These documentation files are included as HTML files on the PRIMECLUSTER Framework CD Release notices provide late breaking information about installation configuration and operations for PRIMECLUSTER Read this information first e RMS Wizards documentation package Available on the PRIMECLUSTER CD These documents deal with topics such as the configuration of file systems and IP addresses They also describe the different kinds of wizards U42117 J Z100 4 76 3 Conventions Preface Suggested documentation The following manuals contain information relevant to RMS administration and can be ordered through your sales representative not available in all areas e ANSI C Programmer s Guide e LAN Console Installation Operation and Maintenance e Terminal TM100 TM10 Operating Manual e PRIMEPOWER User s Manual operating manual Your sales representative will need your operating system release and product version to place your order 1 3 Conventions To standardize the presentation of material this manual use
207. g problem If the checksum reported by the remote host is different from that of the local host and if the configuration for the local host does not include the remote host s name but the configuration for the remote host hostname includes the local host Action Make sure that the local and the remote host are running the same configuration e ADC 5 Since this host lt hostname gt has been online for more than time seconds and due to the previous error it will remain online but neither automatic nor manual switchover will be possible on this host until lt detector gt detector will report offline or faulted If the checksums of the configurations of the local and the remote host are different and if more than time seconds have elapsed since this host has gone online time is the value of the environment variable HV_CHECKSUM_INTERVAL if set or equal to 120 seconds if not then RMS prints the above message Action Make sure that all the hosts in the cluster are running the same configu ration file e ADC 15 Global environment variable lt envattribute gt is not set in hvenv file This message is the result of RMS being unable to set the global environment variable lt envattribute gt because it has not been set in hvenv envattribute can be any one of the following RELIANT_LOG_LIFE RELIANT _SHUT_MIN_WAIT HV_CHECKSUM_INTERVAL U42117 J Z100 4 76 197 ADC Admin configuration Non fatal error messages HV_
208. gure 38 Global settings Alternatelps for fuji2RMS 1 HELP 5 fuji2rmsAl02 2 RETURN 3 FREECHOICE 4 fuji2rmsAI01 Choose the RMS IpAlias Figure 38 Global settings Alternatelps second menu gt Select 4 fuji2rmsAI01 The Global settings Alternatelps first menu for fuji2RMS appears Figure 39 74 U42117 J Z100 4 76 Configuration example Adding Alternatelps to the cluster Linux only Global settings Alternatelps for fuji2RMS 1 HELP 4 NONE 2 NO SAVE 5 AdditionalAlternatelps 3 SAVE 6 IpAliasForM 0 fuji2rmsAl01 Choose the RMS IpAlias to process Figure 39 Global settings Alternatelps first menu with first interface Repeat the previous two steps but this time choose 5 fuji2rmsA102 The Global settings Alternatelps first menu for fuji 2RMS will then appear with both A1ter atelps Figure 40 5 Global settings Alternatelps for fuji2RMS HELP 5 AdditionalAlternatelps 2 NO SAVE 6 IpAliasForM 0 fuji2rmsAl01 3 SAVE 7 IpAliasForM 1 fuji2rmsAl02 4 NONE Choose the RMS IpAlias to process Figure 40 Global settings Alternatelps first menu with both interfaces gt Select 3 SAVE This will save the list of Alternatelps for fuji2RMS and return you to the Global settings main menu which has been updated with the new information Figure 39 Global settings main menu consistent 1 HELP 2 NO SAVE EXIT 3 SAVE EXIT 4 ShowTurnkeyWi
209. hat no corre sponding application goes online on one of these hosts A primary objective of RMS is to ensure that no data losses occur as a result of simultaneous activity of an application on several hosts 156 U42117 J Z100 4 76 Advanced RMS concepts Offline processing It can be extremely damaging if a userAppl1ication is online on more AN than one host directly after RMS has initialized In this case RMS generates a FATAL ERROR message and blocks any further requests for the userApplication This minimizes the possibility of damage caused by inconsistency in the cluster 6 5 Offline processing Normally offline processing results in the userApp1ication transitioning to the Offline state 6 5 1 Offline request In normal operating mode only the RMS command interface can generate an offline request In the case of a fault the userApp1 ication generates its own offline request such as if one or more necessary resources fails this prevents an application that is no longer operating correctly from continuing to operate in an uncontrolled manner see also the section Fault processing on page 159 This offline request is also a primary precondition for any subsequent switchover 6 5 2 Offline processing in a logical graph of a userApplication Unlike online processing the direction of offline processing is from the userAp plication to the leaf node top down Nodes without a detector execute the post offline scr
210. he offline process completes the userApp1ication notifies the corresponding userApplication nodes on the other hosts that the application has gone offline In the case that the hvshut command is used RMS initiates offline processing and the userAppl ication checks the state of other userApp1 ication nodes on the local host RMS is then terminated if all of these local userApplication nodes are offline 6 5 3 Fault situations during offline processing The section Fault processing on page 159 describes the processing of any faults that occur during offline processing The following can cause faults during offline processing e Detector indicates the Faulted state e Detector signals the On1 ine state for a node that was reported as Offline e Script fails with an exit status other than 0 e Script fails with a timeout e Node is not detected by the detector as being Of f1 ine within a specific period after the offline script completes e Child of a node indicates a fault 158 U42117 J Z100 4 76 Advanced RMS concepts Fault processing 6 5 4 Node is already offline Ifa node is already offline at the start of offline processing a situation which can occur only in nodes below an OR node the request is merely passed through similar to the situation in online processing scripts are not executed and the Wait state is not entered 6 5 5 Node does not have an Offline state RMS covers an extremely wide range of syst
211. he Detach button Use the Attach button to attach it again Figure 116 shows a detached log amp var opt reliant log app2 log on fujiZRMS naka oe Time 2002 Ely rofEm gt s Ejp 10 Ea 13 Elm End Time 2002 Ely 10 Em 25 jp fu En 13 Elm Resource Name No Selection v Severity No Selection Y Non zero exit code E Keyword Status Done lava Applet Window Figure 116 Detached log 176 U42117 J Z100 4 76 Troubleshooting Using the log viewer 7 4 1 Search based on resource Searches based on the name of the resource apply only to application logs Search the log files based on the name of a resource as follows 1 Select the name of the resource from the pull down list 2 Press the Filter button Figure 117 shows the window for a search based on the resource name ivarioptireliantiog app2 log on fuji2RMS Time Filter Enable Start Time 2002 Ejw 10 jm 25 Elo 10 Eh 13 Elm vi End Time 2002 Ejw fiom 25 Elo 11 En 13 Elm Keyword Filter Resource Name No Selection if Severi laucommandlinesOk_Cmd_APP2 AllControllersOk_Cmd_APP2 AllControllersOk_app2 Keyword AllRealControllersOk_CtlAPP2 Cmd_APP2 2002 10 25 10 24 30 Controllerd00Of Cmd_APP2 ontrollerd00Of_app2 2002 10 25 10 24 30 Controllerd000f_app2 Imd_APP2 2002 10 25 10 24 30 NOTICE enable resource detection for AllControllers
212. he Wizard Tools interface Configuration elements 3 6 Configuration elements This section discusses some basic elements that are part of a high availability configuration Most of them have been mentioned in previous sections Additional details are provided here to assist you in understanding how they are used by the wizards Users do not have to deal with any of the items listed in this section directly RMS Wizards manage all the basic elements for a high avail ability configuration This section is provided only to help users better understand the configuration elements 3 6 1 Scripts Scripts are used in a high availability configuration to perform several kinds of actions Among the most important types of actions are the following e Bringing a resource to an Online state e Bringing a resource to an Offline state As an example of a script sending a resource Of f 1 ine you might think of a file system that has to be unmounted on a node where a fault occurs An offline script would use the umount command to unmount the file system Another script might use the mount command to mount it on a different node Besides such online and offline scripts there are also pre online and pre offline scripts for preparing transition into the respective states as well as a number of other scripts The RMS Wizards provide a complete set of scripts for several pre defined application types such as R 3 or Oracle If you assign your applicat
213. he online processing of another userApp1 ication Normally this process results in the userApplication transitioning to the Online state The following situations can prevent the userApp1 ication from transitioning to the Online state PreCheckScript determines that the userApp1iction should not come online e Fault occurs during online processing These situations are discussed in detail in later sections 6 4 1 Online request Generating the online request is referred to as switching the userApp1i cation that is switching the userApp ication online or switching the userApplication to another cluster node refer also to the section Switch processing on page 165 The following actions can generate an online request e Manual request using the GUI e Manual request using the CLI e Automatic request at RMS startup e Automatic request when a fault occurs 6 4 1 1 Manual methods Both manual methods have two modes for switching the userApplication These modes are as follows e Priority switth RMS selects the SysNode The userApplication is switched to the highest priority SysNode The order of the children in the userApplication node determines the SysNode priorities e Directed switch The user selects the SysNode The userApplication is switched to a specific SysNode In both priority and directed switches only SysNodes that are in the Online state may be selected 152 U42117 J Z100 4 76 Advanced RMS c
214. he switchlog file as well as the application specific log file and any appropriate detector log files may all need to be viewed and interpreted 7 9 1 RMS Wizards detector logging The RMS Wizard detectors log information to both the switch1og file and to their own detector log file hvdet_xxx gnn1 og for example hvdet_icmp g964109 All resource state changes are logged both to the switchlog file and to their own detector log file Other detector messages are not logged to the switchlog file A detector log file is created for each instance of a detector running Each detector maintains an internal 10 KB memory for logging debugging messages which are then printed out to the log file when an unexpected resource status report occurs The buffer is a circular buffer such that if it fills before anything is printed out it will be reused from the beginning and any existing data contained within the buffer will be overwritten and lost Each internal log message in the detector has an associated logging level Only those messages which are lesser than or equal to the current log level setting will be added into the internal circular buffer By default only the internal messages marked with a debugging level of 1 are inserted into the buffer The greater the value the more debugging information is printed however the contents of logs may vary from detector to detector The valid range of values is 1 to 9 default value is 1 This can be modif
215. heckCommands 0 For technical reasons spaces are displayed as tildes within the wizard menu commands The actual commands do not have tildes U42117 J Z100 4 76 69 Specifying a display Configuration example 4 7 Specifying a display Specify the display within the CommandLines menu as follows Select Display by entering the number 5 A list of display options appears Figure 32 1 HELP 2 RETURN 3 FREECHOICE 4 fujilADM 5 fuji2ADM 6 fuji3ADM 7 fujiRCA 8 fujiSCON 9 fuji2 10 fuji3 11 fuji2RMS 12 fuji3RMS Choose a display for this application 3 gt gt 172 25 220 27 Figure 32 List of display options You can choose from the list of detected hosts or you can select 3 FREECHOICE to specify an arbitrary host with a suitable display gt Select FREECHOICE by entering the number 3 At the gt gt prompt enter the host name or IP address for the X window display In this example we use the IP address 172 25 220 27 but you should enter an address in your LAN Completing the FREECHOICE step initiates another consistency check Figure 33 70 U42117 J Z100 4 76 Configuration example Specifying a display 1 2 3 4 5 6 7 8 9 10 11 Command HEL SAV REM Dis Sta Sto G ne Au Mo Choose Consistency check Lines Dem_APP1 consistent p E EXIT OVE EXIT play 172 25 220 27 rtCommands
216. hed during a priority switchover ordering a switchover after a Fault The list is processed circularly The user specifies this attribute indirectly when selecting the nodes for an application RMS uses the order in which the nodes were selected and creates PriorityList automatically The user can change the Priori tyList by adding individual nodes from the list in the desired order rather than automatically selecting the entire list For applications controlled by a Follow controller the order of nodes in PriorityList is ignored However each child application must be able to run on the nodes specified for the controller object e Scalable Possible Values 0 1 Default 0 Valid for controller objects Specifies whether or not the object is a Scalable controller The user changes this attribute indirectly by selecting the controller type in the configuration interface U42117 J Z100 4 76 331 Attributes available to the user Appendix Attributes If set to 1 then the object is a Scalable controller and Resource must contain the list of child applications Other attributes must be set as follows these are set automatically by the configuration tools IndependentSwitch 1 AutoRecover 0 IgnoreOfflineRequest 0 IgnoreOnlineRequest 0 Fol low 0 If Scalable is set to 1 then Follow must be set to 0 Scalable and Follow control policies are mutually exclusive e ScriptTimeout Possible Values 0 MAXINT in seconds
217. ic modification timeout U42117 J Z100 4 76 199 ADC Admin configuration Non fatal error messages The time taken for dynamic modification is greater than the timeout value This timeout is equal to the value of the environment variable MODIFYTIMEOUTLIMIT if it is greater than 0 or else it is equal to 0 if the value of the environment variable is less than or equal to O If the environment variable itself is not defined then the timeout value is 120 seconds by default Action Contact the field support e ADC 34 Dynamic modification timeout during start up bm will exit If the time taken for dynamic modification during bm startup is greater than the timeout value which is determined from the value of the environment variable MODIFYTIMEOUTLIMIT if it is greater than O or equal to 0 if the value of the environment variable is less than or equal to 0 or 120 seconds by default if the environment variable is not defined RMS then exits with exit code 63 Action Contact the field support e ADC 35 Dynamic modification timeout bm will exit Critical internal error Action Contact field support e ADC 37 75 Dynamic modification failed cannot make a non critical resource lt resource gt critical by changing its attribute MonitorOnly to 0 since this resource is not online while it belongs to an online application lt appname gt switch the application offline before making this resource critical During dyna
218. ics 15 7 PCS System administration pcstool Modifies PCS configurations from the command line pescui Character based interface for PCS pcs_reinstall Utility for re integrating PCS with dependent products maketrusted Utility to install signed version of PCS U42117 J Z100 4 76 351 RCVM Appendix List of manual pages 15 8 RCVM RCVM is not available in all markets System administration dkconfig virtual disk configuration utility dkmigrate virtual disk migration utility vdisk virtual disk driver dkmirror mirror disk administrative utility File format dktab virtual disk configuration file 15 9 Resource Database i To display a Resource Database manual page add etc opt FUSVcluster man to the environment variable MANPATH System administration clautoconfig execute of the automatic resource registration clbackuprdb save the resource database clexec execute the remote command cldeldevice clinitreset reset the resource database delete resource registered by automatic resource registration 352 U42117 J Z100 4 76 Appendix List of manual pages RMS clrestorerdb restore the resource database clsetparam display and change the resource database operational environment clsetup set up the resource database clstartrsc resource activation clstoprsc resource deactivation clsyncfile distribute a file between cluster nodes User command clgettr
219. ied in the hvw command as follows 1 Select the Configuration Edit Global Settings menu 2 Choose the DetectorDetails sub menu 3 Select MemoryLogLevel U42117 J Z100 4 76 189 Wizard log files Troubleshooting When an unexpected Offline or Fault resource state occurs the debugging messages are printed from the circular buffer into the detector log file The infor mation is intended to help determine why the unexpected status report occurred Because the circular buffer stores earlier logging messages the log file will contain several DEBUG statements with dates prior to the last reported item appearing prior to printing out the circular buffer The reason for keeping and printing the circular buffer is that a problem has occurred and with the aid of the debugging statements printed from the circular it can be determined why the detector reported an unexpected resource state change 7 9 2 Modifying levels while RMS is running It is now possible to turn debug reporting on or off within the RMS Wizard detectors dynamically by using the hvw command as follows 1 Select Configuration Edit Global Settings 2 Choose the DetectorDetails sub menu 3 Select the DynamicDetectorLogging menu item The default value is 0 which means that debugging is turned off By setting the value to something greater than zero debugging is turned on The greater the value the more debugging information that is printed however the contents of
220. if RMS is running In this case RMS would need to be shut down first The nodes that are currently not running RMS will have the persistent status information removed during the Configuration Activate process After the configuration has been activated successfully you can return to the Main configuration menu From there you can quit the configuration procedure U42117 J Z100 4 76 51 Activating a configuration Using the Wizard Tools interface gt Press Enter to return to the Main configuration menu see Figure 17 fuji2 Main configuration menu curren No RMS active in the cluster 1 HELP 2 QUIT 3 Application Create 4 Application Edit 5 Application Remove 6 Application Clone 7 Configuration Generate 8 Configuration Activate 9 Configuration Copy Choose an action 2 0 1 2 3 4 5 6 7 t configuration mydemo Configuration Remove Configuration Freeze Configuration Thaw Configuration Edit Global Settings Configuration Consistency Report Configuration ScriptExecution RMS CreateMachine RMS RemoveMachine Figure 17 Quitting the Main configuration menu gt Select QUIT by entering the number 2 This ends the activation phase of the configuration process Usually the next step is to start RMS to monitor the newly configured application gt Start RMS with the GUI or with the following command hvem a 52 U42117 J Z100 4 76 Using t
221. ification is aborted If the HostName attribute is missing ADM 40 will take care of it Action Set the HostName attribute of resource lt object gt to the name of a valid SysNode e ADM 46 12 Dynamic modification failed linking the same resource lt object gt to different applications lt appnamel gt and lt appname2 gt RMS received a directive to add a new child object lt object gt by linking it to parent objects belonging to different applications lt appnamel gt and lt appname2 gt Dynamic modification is aborted Action When adding a new child resource make sure that it does not have as its parents resources belonging to different applications e ADM 47 23 Dynamic modification failed parent object lt parentobject gt belongs to a deleted application Any attempt to add a new node having as its parent lt parentobject gt fails ifthe parent lt parentobject gt is the child of an object that has been deleted because deleting an object automatically causes its children to be deleted as well ifthey don t have any other parents This causes dynamic modification to fail Action 216 U42117 J Z100 4 76 Non fatal error messages ADM Admin command and detector queues When adding a new object make sure that its parent has not already been deleted e ADM 48 24 Dynamic modification failed child object lt childobject gt belongs to a deleted application Any attempt to delete an object
222. iguration a second time on page 88 An abbreviated version of this example appears in the Installation Guide To avoid network access problems perform RMS configuration tasks as root and ensure that rhosts and the rcp rsh services are configured as described in the Installation Guide 4 1 Stopping RMS Before you create or edit a configuration ensure that RMS is not active on any machine that would be affected by the changes You can use the Cluster Admin GUI see the section Stopping RMS on page 130 or you can enter the following command to stop RMS on all nodes from any machine in the cluster hvshut a U42117 J Z100 4 76 57 Creating a configuration Configuration example 4 2 Creating a configuration gt Enter the following command to generate the wizard menu for the configu ration example mydemo hvw n mydemo This will create an RMS configuration file named mydemo us in the opt SMAW SMAWRrms directory If you choose a different name and location the combined length of the file name and path should not exceed 80 characters The RMS configuration menu appears displaying the name of the configuration at the top of the menu Figure 18 fuji2 Main configuration menu current configuration mydemo No RMS active in the cluster 1 HELP 0 Configuration Remove 2 QUIT 1 Configuration Freeze 3 Application Create 2 Configuration Thaw 4 Application Edit 3 Configuration Edit Gl
223. ild lt childobject gt first needs to be brought to the online state before linking it to the online parent lt parentobject gt e ADM 20 51 Dynamic modification failed cannot link child lt childobject gt which is neither offline nor standby to offline or standby parent lt parentobject gt Any attempt to link 2 existing objects in which the child is neither in the Offline nor the Standby state and the parent is in the Offline or Standby states is prohibited and results in the message being written to the switchlog Dynamic modification is aborted Action The child needs to be first brought to offline or standby state before linking it to the parent that is in offline or standby state e ADM 21 44 Dynamic modification failed Cannot unlink parent lt parentobject gt and child lt childobject gt since they are not linked Trying to unlink object lt parentobject gt from object lt childobject gt when they are not already linked results in this message with dynamic modification aborted Action If you want to unlink 2 objects make sure that they share a parent child relationship e ADM 22 46 Dynamic modification failed child lt childobject gt will be unlinked but not linked back to any of the applica tions Unlinking a child lt childobject gt so that no links remain linking it to any userApplication is not allowed Action Make sure that the child is still linked to a userAppl ication e ADM 23 47
224. ing Multiple controllers simultaneously accessing a set of disk drives native operating system The part of an operating system that is always active and translates system calls into activities network partition CF node This condition exists when two or more nodes in a cluster cannot communicate over the interconnect however with applications still running the nodes can continue to read and write to a shared device compromising data integrity A host which is a member of a cluster A computer node is the same as a computer node state CF Every node in a cluster maintains a local state for every other node in that cluster The node state of every node in the cluster must be either UP DOWN or LEFTCLUSTER See also UP CF DOWN CF LEFTCLUSTER CF object RMS In the configuration file or a system graph this is a representation of a physical or virtual resource See also leaf object RMS object definition RMS object type RMS 366 U42117 J Z100 4 76 Glossary object definition RMS An entry in the configuration file that identifies a resource to be monitored by RMS Attributes included in the definition specify properties of the corresponding resource The keyword associated with an object definition is object See also attribute RMS object type RMS object type RMS A category of similar resources monitored as a group such as disk drives Each object type has specific prop
225. ion 101 Figure 74 Confirmation pop up window 102 Figure 75 Confirmation pop up window for scalable application 102 Figure 76 Viewing the RMS switchlogfile 103 U42117 J Z100 4 76 381 Figures Figure 77 Figure 78 Figure 79 Figure 80 Figure 81 Figure 82 Figure 83 Figure 84 Figure 85 Figure 86 Figure 87 Figure 88 Figure 89 Figure 90 Figure 91 Figure 92 Figure 93 Figure 94 Figure 95 Figure 96 Figure 97 Figure 98 Figure 99 Figure 100 Figure 101 Viewing the RMS switchlog file in a detached window Viewing the application log Search based on date and time filter Using the Find pop up in log viewer RMS full graph o RMS application graph RMS subapplication graph Composite subapplication graph Configuration information pop up Command pop up 2 000 ee ees RMS graph with affiliation names RMS graph with resource names RMS graph with affiliation names and resource names RMS graph after RMS is shutdown Clusterwide table 0 Faulted and offline applications in the clusterwide table Exclamation marks in clusterwide table and the RMS tree Command pop ups in clusterwide table Before RMS is shut down After RMS is restarted with a differen
226. ion e CLI The commands here employ the most commonly used options For more details about any command see the online manual pages which are listed in the chapter Appendix List of manual pages on page 349 The commands are located in the RELIANT_PATH bin directory All the RMS CLI commands accept both CF node names and RMS node names for SysNode objects when the RMS naming convention is followed that is the names are of the form nodenameRMS U42117 J Z100 4 76 125 RMS procedures Administration 5 3 1 Starting RMS 1 From the Cluster Admin rms tab select Tools gt Start RMS Figure 97 Figure 97 Starting RMS from the main menu 126 U42117 J Z100 4 76 Administration RMS procedures 2 The RMS Start Menu window opens To start RMS on all nodes click the all available nodes radio button and then click OK Figure 98 EARMS Start Menu x Notes This dialog will allow you to start RMS on one or more available nodes Nodes shown are in a state in which they can be started N Start RMS all available nodes one node from the list Selection Ok Cancel Java Applet Window Figure 98 RMS Start Menu for all nodes U42117 J Z100 4 76 127 RMS procedures Administration 3 To start RMS only on selected nodes click the one node from the list radio button select the desired node or nodes using the checkboxes in the Selectio
227. ion Specify the kind for the generic detector in hvgdstartup e BAS 24 ERROR IN CONFIGURATION FILE The object object has an invalid rKind attribute Objects of type gResource must have a valid rKind attribute Object object has an invalid rKind attribute Action Make sure that the object object has a valid rKind attribute e BAS 25 ERROR IN CONFIGURATION FILE The object object has a ScriptTimeout value that is less than its detector report time This will cause a script timeout error to be reported before the detector can report the state of the resource ncrease the ScriptTimeout value for object currently seconds seconds to be greater than the detector cycle time currently detectorcycletime seconds This message is the result of the ScriptTimeout value being less than the detector cycle time This will cause the resource to appear faulted when being brought Online or Offline Action Make the value of ScriptTimeout greater than the detector report time e BAS 26 ERROR IN CONFIGURATION FILE The type of object lt object gt cannot be or and and at the same time 230 U42117 J Z100 4 76 Non fatal error messages BAS Startup and configuration errors Each RMS object must be of a type derived from or or and types but not both If this message appears in the switchlog it indicates of a severe corruption of the RMS executable Action Contact field support e BAS 27 ERROR IN CONFIGURATION FILE
228. ion Definition Language CDL to provide a user friendly configuration environment for products such as RMS The standard templates can be modified or replaced to provide a customized interface for specific applications or installations PRIMECLUSTER services CF Service modules that provide services and internal interfaces for clustered applications private network addresses Private network addresses are a reserved range of IP addresses specified by the Internet Assigned Numbers Authority They may be used internally by any organization but because different organizations can use the same addresses they should never be made visible to the public internet private resource RMS A resource accessible only by a single node and not accessible to other RMS nodes See also resource RMS shared resource public LAN The local area network LAN by which normal users access a machine See also Administrative LAN queue See message queue redundancy This is the capability of one object to assume the resource load of any other object in a cluster and the capability of RAID hardware and or RAID software to replicate data stored on secondary storage devices 368 U42117 J Z100 4 76 Glossary Reliant Monitor Services RMS The package that maintains high availability of user specified resources by providing monitoring and switchover capabilities remote node A node that is accessed through a LAN or telecommunications line
229. ion example Creating a second application 4 10 Creating a second application In this section the mydemo configuration is expanded by adding a second appli cation This example application differs from the first because duplicate config uration procedures are skipped to simplify the example However in other parts of the procedure new features add to the complexity of the mydemo configu ration The second application differs from the first as follows e The application uses a new application type GENERIC instead of DEMO We will use the name APP2 for the second application e APP2 will control the first application APP1 Therefore APP2 must be configured with a controller sub application Resume the configuration procedure as follows gt Stop RMS if it is running gt Return to the Main configuration menu with the following command hvw n mydemo The Main configuration menu opens see Figure 46 fuji2 Main configuration menu current configuration mydemo No RMS active in the cluster 1 HELP 0 Configuration Remove 2 QUIT 1 Configuration Freeze 3 Application Create 2 Configuration Thaw 4 Application Edit 3 Configuration Edit Global Settings 5 Application Remove 4 Configuration Consistency Report 6 Application Clone 5 Configuration ScriptExecution 7 Configuration Generate 6 RMS CreateMachine 8 Configuration Activate 7 RMS RemoveMachine 9 Configuration Copy Choose an action
230. ion graph is similar to the full graph except that it shows just a single application and its resources The graph is shown from the perspective of the selected node Figure 82 MM app2 fujizRrms 101 Java Applet Window Figure 82 RMS application graph 110 U42117 J Z100 4 76 Administration Using Cluster Admin 5 2 5 3 Subapplication graph You can see a graph for a subapplication by right clicking on a subappli cation The subapplication graph lists all the subapplications used by a given application and it shows the connections between the subapplications Figure 83 M app2 fujizRMs xl fujiZRMS a app2 Hema APP2 Ct _APP2 2JControtlero01 Of_Ctl_APP2 lava Applet Window Figure 83 RMS subapplication graph U42117 J Z100 4 76 111 Using Cluster Admin Administration 5 2 5 4 Composite subapplication graph The composite subapplication graph is a variation of the subapplication graph If an application has a dependency on another application by means of a controller object then the composite subapplication graph can be used to show all the subapplications that the application depends on directly or indirectly For example a composite graph may depict a Web Server application that depends on an Oracle Database Server application The composite subapplication graph takes the controller object in a subappli cation graph and appends the subapplication graph of the controlled a
231. ion to one of these standard types you automatically take advantage of the built in scripts The hvexec command executes scripts for a high availability configu i ration monitored by RMS For more details on the command hvexec please refer to the Primer document which is described in the section Further reading on page 55 U42117 J Z100 4 76 53 Configuration elements Using the Wizard Tools interface 3 6 2 Detectors Detectors are processes that have the task of monitoring resources If there is a change in the state of a resource for example of a disk group the detector in charge notifies the RMS base monitor The base monitor may then decide to have a script executed as a reaction to this changed state Like the built in scripts described in the previous section the RMS Wizards provide built in detectors for pre defined application types If you assign your application to one of these standard types it automatically uses the built in detectors 3 6 3 RMS objects A high availability configuration can be seen as a set or group of objects with interdependencies Any application or resource that is part of the configuration is then represented by one of the objects The interdependences of objects can be displayed as a graph called the RMS graph These are the most important object types used in RMS configurations e userApplication Represents an application to be configured for high availability e SysNode
232. ious error was a failure to open the file needed for dynamic startup The base monitor will exit Action Verify the file existence and reissue dynamic startup request 9 2 ADM Admin command and detector queues e ADM 1 cannot open admin queue 282 U42117 J Z100 4 76 Fatal error messages BM Base monitor RMS uses UNIX message queues for interprocess communication The admin queue is one such queue used for communication between utilities like hvuti1 hvswitch etc If there is a problem opening this queue then this message is printed and RMS exits with exit code 3 Action Contact field support e ADM 2 RMS will not start up errors in configuration file When RMS is starting up it performs dynamic modification under the hood if during this phase it encounters errors in its configuration file RMS exits with exit code 23 Action Make sure there are no errors in the configuration file based on the error messages printed prior to the above message in the switchlog 9 3 BM Base monitor e BM 3 Usage progname c config_file m h time level r count w time n If RMS has not been invoked in the right way because either some arguments were missing or haven t been used correctly this message is printed out to the switchlog indicating the arguments RMS exits with exit code 3 Action Start RMS with the right arguments e BM 49 Failure calculating configuration checksum Duri
233. ious logging procedures i Specify logging with the 1 lowercase L flag To activate logging shut down RMS and restart it with one of the log levels described in Table 9 or use the hvuti1 command to set the logging after RMS has been started The log level specified with the 1 option is a list with numbers or arange Separate levels by means of commas or spaces in the list If a space is used as a list separator include the entire argument between the braces A level range is defined as n n2 This includes all log levels from n up to and including n2 The n2 range is the same as 1 12 The n range defines all log levels above n1 The n1 value must be greater than or equal to 1 All log levels refer to internal functions of the base monitor and are only relevant for service personnel In addition executing RMS with several active log levels will affect system performance If log level 0 is defined all possible log levels are activated Valid log levels are listed in Table 9 Log Level Meaning Turn on all log levels Unused Turn on detector tracing Unused Turn on mskx tracing stack tracing of the base monitor Error or warning message Heartbeats Base monitor level 0 N D O 2 O0O Dn ejo Detector error o Administrative command message 10 Basic type level 11 Dynamic reconfiguration contracting level Table 9 Log levels 182 U42117 J Z
234. ipt immediately after the offline script The offline process is as follows 1 The userApplication changes to the Wait state 2 The userApplication executes its pre offline script and sends a corre sponding request to its children after the pre offline script terminates 3 After receiving the pre offline script each child node changes to the Wait state executes its pre offline script and forwards the request 4 As soon as the leaf nodes have completed their pre offline script they send a corresponding message confirmation of successful offline processing to their parents U42117 J Z100 4 76 157 Offline processing Advanced RMS concepts 5 The message is forwarded without any further activity from the children to the parent until it arrives at the userApplication 6 After pre offline processing has been completed the userApplication executes its offline script immediately followed by the post offline script userAppl ication is a node without a detector 7 The userApplication then generates the actual offline request Processing of the offline request in the individual nodes is similar to online processing as follows e The offline script is executed first e The post offline script is started after the detector report Offline has arrived e The request is forwarded to the children after the post offline script has completed As illustrated the userApp1 ication is the final node to go offline After t
235. ir state Two conclusions can be drawn from the above e Leaf nodes without a detector are illegal in an RMS configuration since they do not contain a detector report and they are not able to logically derive their state from the state of their children Their state always remains Unknown e All transitions from the Unknown state are always bottom up such as from the leaf node to the userApp1ication Every node above the leaf node first requires the state of its children before it is able to determine its own state After the userApp1ication exits the Unknown state the initializing process of the application ends From this point RMS controls the application The initializing processes of userApp1ication nodes are independent of each other Therefore one application can be initialized whereas another application can be Unknown The initializing process of SysNodes is also independent Initially in the Unknown state a SysNode exits after receiving the detector report Thus it does not wait for messages from its children userApp1ication This again illus trates the independence of the parent child relationship between the userAp plication and the SysNode The Unknown state is a pure initial state Once a node exits the Unknown state it does not return to that state U42117 J Z100 4 76 151 Online processing Advanced RMS concepts 6 4 Online processing The online processing for a userApp1ication is independent from t
236. is the result with the detector exiting with exit code 106 Action Contact field support e NOD 10 detector Failed to open rep_queue The detector hvdet_node utilizes the queue rep_queue to report the state of the other SysNodes in the cluster to the base monitor running on the same host as the detector If there is a problem in sending the state over to the base monitor this message is printed out and the detector exits with exit code 112 Action Contact field support e NOD 11 service getservbyname returned NULL If the detector has been unable to find the port at which the service lt service gt resides this message is printed to the switchlog and the detector exits with exit code 126 Action This is probably due to the absence of an entry for service lt service gt in etc services e NOD 12 detector no NODE_SYS_0 The detector hvdet_node uses the queue NODE_SYS_0 to get the list of SysNodes from the base monitor running on the same host as the detector The detector tries to create this queue until it is successful or for 10 times whichever is shorter If after these attempts it is still unsuc cessful it prints out the above message and exits with exit code 106 Action Contact field support e NOD 13 The RMS CF CTP mapping for SysNode lt sysnode gt to the CIP name has failed Please verify all entries in etc hosts and etc cip cf are correct and that CF and CIP are fully configured U4
237. istrator Action Verify the problem by using hvdisp T SysNode to see the states of all SysNode objects The hvdisp command does not require root privilege If you verify that a SysNode is in a pending Wait state call hvutil 0 lt SysNode gt or hvutil u lt SysNode gt U42117 J Z100 4 76 193 RMS troubleshooting Troubleshooting Caution hvutil u causes the surviving node to assume that the SysNode is actually dead and it will invoke a failover immediately If the node is still active this may cause data corruption Caution hvutil o causes the surviving node to assume that the SysNode was alive the entire time Therefore it will continue assuming to be in sync with the remote SysNode If this assumption is not true this could cause unpredictable behaviors and in a worst case scenario data corruption The detector cycle time can be changed from its default value by using the w option in hvcm command as hvcm w lt n gt c lt config_file gt where n is the new detector cycle time This value must be greater than the value of HV_CONNECT_TIMEOUT e The RMS base monitor detects a loss of cluster heartbeat but there is no indication as to the reason for the loss RMS automatically invokes a tool that provides diagnostic information for this event Action The diagnostic tool performs the following actions Invokes truss 1 on Solaris or strace 1 on Linux to trace the detector pr
238. iting an application Turnkey mode highly recommended Turnkey mode is the default mode This mode is highly recommended because it simplifies compli cated tasks like creating linkages between application and sub applica tions Non turnkey mode only for expert users Non turnkey mode is meant for advanced expert users only If this mode is to be used some rules must be followed Otherwise the resulting configuration may remain in an inconsistent state and RMS will not start Usage of this mode is not within the scope of this guide e Application Remove Removes an existing application from the high avail ability configuration U42117 J Z100 4 76 43 Creating and editing a configuration Using the Wizard Tools interface e Application Clone Clones an application This feature is provided for users who want to create a new application that differs only slightly from an existing one To do this clone an application and modify only the parts that are necessary to create a new one e Configuration Generate Performs the following Runs consistency checks on the configuration Creates the RMS graph of the configuration and stores it in the configname us file The graph is a hierarchical description of objects that represent the nodes applications and resources used in the configu ration During the Configuration Generate phase the wizard indicates the progress with a series of dots on the screen
239. its controlled application or change the controller s Nul 1Detector attribute to 1 e ADM 87 67 Dynamic modification failed only local attributes such as ScriptTimeout DetectorStartScript NullDetector or MonitorOnly can be modified during local modification hvmod 1 U42117 J Z100 4 76 223 ADM Admin command and detector queues Non fatal error messages The reason for this message is that only the modification of local attributes is allowed during local modification Action Make a non local modification or modify different attributes e ADM 88 68 Dynamic modification failed attribute lt attribute gt is modified more than once for object lt object gt This message may appear because an attribute of a particular object can be modified only once in the same modification file but lt attribute gt has been modified more than once for lt object gt Action Modify the attribute only once per object e ADM 89 69 Dynamic modification failed cannot rename existing object lt sysnode gt to lt othersysnode gt because either there is no object named lt sysnode gt or another object with the name lt othersysnode gt already exists or a new object with that name is being added or the object is not a resource or it is a SysNode or it is a controlled application which state will not be compatible with its controller This message appears when we try to rename an existing object lt sysnod
240. its with exit code of 2 or 6 depending on one of the following conditions If an unknown option is used the exit code is 2 Ifthe hveject utility is invoked directly without any options or arguments the exit code is 6 Action Follow the expected usage for the utility e Usage hvjoin s host An attempt to use the hvjoin utility in a way that does not conform to the expected usage leads to this message and the utility exits with exit code of 2 or 6 depending on one of the following conditions If an unknown option is used the exit code is 2 Ifthe hveject utility is invoked directly without any options or arguments the exit code is 6 Action Follow the expected usage for the utility U42117 J Z100 4 76 317 Console messages in alphabetical order Console error messages e Usage hvlogclean d An attempt to use the hvl ogc1 ean utility in a way that does not conform to the expected usage leads to this message and the utility exits with exit code 6 Action Follow the expected usage for the utility e Usage hvmod i 1 f config_file us E L 1 1 c modification directives If the hvmod utility is invoked in any one of the following ways hvmod exits with exit code 6 If hvmod is invoked without any options If hvmod is invoked with the 1 or i options but with arguments when none are expected Action Follow the expected usage for the utility e Usage
241. jiZRMS The application is currently busyllocked try again later lo lo lapp2 app2 15974 2003 09 04 17 22 02 300 A PreCheckScript hvexec p app2 mydemo PreOnlineScript rm f usroptrelianttmp app2 goingomine if W HV_INTENDED_STATE Onli MineScript hvenable app2 ALL rm f ust optreliantimp app2 online touch Jusr optrelianttr MineDoneScript rm flusrioptireliantitmp app2 goingofline Qoniine it Offline Deact Faulted Unknown Inconsistent stand By Warning OfflineFault rms amp pes KI Java Applet Window Figure 107 Switching a busy application CLI Refer to the section Starting an application on page 134 for information on this command U42117 J Z100 4 76 137 RMS procedures Administration 5 3 5 Taking an application offline Shut down an online application as follows gt Right click on the application object and select the Offline option from the pop up menu Figure 108 ES Cluster Admin 01 p File Tools Preferences Help FUJI Attributes MO nens app2 on fuji2RMS User Application app2 RMS Attribute Value o a o appt AutoRecover 0 i AutoStartUp 0 aE E e AutoSwitchOver No o A UserApplication app2 g d View Application Graph oriy 0 View Subapplication Graph witch 0 View Composite Subapplication Graph naa own 0 llers 512 View logfile late 0 Switch hvswitch
242. l important RMS actions for example incoming switch requests or faults that occur in nodes There are also configuration specific log files in the log directory It is i recommended that administrators evaluate these if necessary The names of these log files depend on the configuration that was set up using the configuration wizards RMS Wizard Tools or PCS Consult the RMS Wizard Tools or PCS online documentation for further information The following log files can also be used for problem solving e hvdet_nodelog e bmlog U42117 J Z100 4 76 171 Log files Troubleshooting 7 3 Log files Table 7 identifies and explains the RMS log files contained in var opt SMAWRrms 1og modue FileName corms base monitor abortstartlog Table 7 Log files Records all messages between objects and all modification instruc tions The default is off This file contains records about bm exit conditions to assist support personnel in determining why RMS failed to start This file is generated if the following message appears during startup FATAL ERROR RMS has failed to start General RMS error and message logging information ranges from simple message reporting to more complete information The error log level determines the contents of this file which is specified when the base monitor is started Refer to the section Specifying the log level on page 182 for more infor mation Includes all messages
243. l01 fuji2rmsAl02 fuji3rmsAl02 Figure 42 Global settings main menu with Alternatelps for both hosts Select 3 SAVE EXIT to save the updated information and return to the Main configuration menu 76 U42117 J Z100 4 76 Configuration example Activating the configuration 49 Activating the configuration As described in the section General configuration procedure on page 40 activating a configuration is the third of the four fundamental steps required to set up a high availability configuration You must stop RMS before activating a configuration In this example we stopped RMS before creating the configuration The starting point for the activation phase is the Main configuration menu Figure 43 fuji2 Main configuration menu current configuration mydemo No RMS active in the cluster 1 HELP 0 Configuration Remove 2 QUIT 1 Configuration Freeze 3 Application Create 2 Configuration Thaw 4 Application Edit 3 Configuration Edit Global Settings 5 Application Remove 4 Configuration Consistency Report 6 Application Clone 5 Configuration ScriptExecution 7 Configuration Generate 6 RMS CreateMachine 8 Configuration Activate 7 RMS RemoveMachine 9 Configuration Copy Choose an action 8 Figure 43 Main configuration menu Select Configuration Activate by entering the number 8 No further input is required at this stage As the Wizard completes each task in the
244. le Services GLS See Global Link Services 362 U42117 J Z100 4 76 Glossary Global Disk Services This optional product provides volume management that improves the availability and manageability of information stored on the disk unit of the Storage Area Network SAN Global File Services This optional product provides direct simultaneous accessing of the file system on the shared storage unit from two or more nodes within a cluster Global Link Services This PRIMECLUSTER optional module provides network high avail ability solutions by multiplying a network route generic type RMS An object type which has generic properties A generic type is used to customize RMS for monitoring resources that cannot be assigned to one of the supplied object types See also object type RMS graph RMS See system graph RMS graphical user interface A computer interface with windows icons toolbars and pull down menus that is designed to be simpler to use than the command line interface GUI See graphical user interface high availability A system design philosophy in which redundant resources are employed to avoid single points of failure See also Reliant Monitor Services RMS interconnect CF See cluster interconnect CF U42117 J Z100 4 76 363 Glossary Internet Protocol address A numeric address that can be assigned to computers or applications See also IP aliasing Internode Communic
245. levant operating system U42117 J Z100 4 76 321 Appendix Operating system error numbers 322 U42117 J Z100 4 76 12 Appendix Object types Table 10 contains a list of all object types that are supplied with RMS The middle column lists the attributes that must be specified or are recommended for the object type in the object configuration file definition Type Required Attributes Description and0p HostName for direct children of a userApplication object Objectthatis associated with its children by a logical AND operator Define this type of object if all children have to be online or offline at the same time controller Resource either Followor Scalable Object within a userApp1i cation that controls one or more userApplication objects gResource rKind rName Custom generic object ENV None required Object containing clusterwide global environment variables ENVL None required Object containing node specific local environment variables orOp None required Object associated with its children by a logical OR operator Define this type of object if at least one child has to be online at all times SysNode None required Node object required Only type userApplication can be defined for the children userApplication None required User application required for every application Only SysNode parents are allowed The attribute HostName must
246. lid for controller objects Specifies the time in seconds allowed for a controller not to react while a child application leaves the Online state 338 U42117 J Z100 4 76 Appendix Attributes Attributes managed by configuration wizards e PersistentFault Possible Values O 1 Default 0 Valid for userApplication objects If set to 1 the application maintains a Faulted state across an RMS shutdown and restart The application returns to the Faulted state if it was Faulted before unless the fault is explicitly cleared by either hvutil c or hvswitch f or if RMS is restarted with the Faulted SysNode removed from the configuration e PreCheckScript Possible Values Valid script character Default empty Valid for userApp1ication objects Specifies the script to be forked as the first action during On1 ine or Standby processing If the script returns with a zero exit code processing proceeds If the script returns with an exit code other than zero processing is not performed and an appropriate warning is logged to the switchlog file e Resource Possible Values Valid name character Default empty Valid for controller objects One or more names of child applications separated by spaces and or tabs e rKind Possible Values 0 2047 Default none Valid for gResource objects Specifies the kind of detector for the object e rName Possible Values Valid string character Defa
247. line is empty or has some incorrect value Action Contact field support e DET 12 DETECTOR STARTUP FAILED lt detector gt REASON error reason If the detector detector could not be started due to errorreason this message is the result The reason errorreason could be any one of the following The detector detector does not exist The detector detector does not have execute permission The process for the detector could not be spawned Ifthe number of processes created by the base monitor at the same time is greater than 128 Action Depending on what the reason for the error is take appropriate action e DET 13 Failed to execute script lt script gt The detector script is not good or the format is not good Action Check the detector startup script 250 U42117 J Z100 4 76 Non fatal error messages DET Detectors e DET 24 FAULT REASON Resource lt resource gt transitioned to a Faulted state due to the resource failing to come Standby after running its OnlineScript onlinescript After a resource executes its online script during standby request it is expected to come Standby If it does not change its state or transitions to a state other than Standby or Online within the period of seconds specified by its ScriptTimeout attribute the resource is considered as being Faulted Action Make sure the Online script moves the resource into Standby or Online state during standb
248. lity is left to the system administrator If RMS detects a severe time discrepancy between the nodes in the cluster an ERROR message is printed to the switchlog NTPD may be used to establish consistent time across the nodes in the cluster Refer to the manual page for xntpd for more information The OnlinePriority persistent state information will be cleared if RMS is restarted with the last On ine node removed from the configuration e OnlineScript Possible Values Valid script character Default empty Valid for all objects except SysNode objects Specifies the script to bring the associated resource to the Online state U42117 J Z100 4 76 329 Attributes available to the user Appendix Attributes e PartialCluster Possible Values 0 1 Default 0 Valid for userApplication objects Specifies if an application can negotiate online requests If set to 0 then the application can negotiate its online request only when all nodes where it can possibly run are online If set to 1 then the application can negotiate its online request within the currently online nodes even if some other nodes including the application s primary node are offline or faulted For an application that contains a Scalable controller i e for a parent application PartialCluster must be set to 1 Each child application must have its attributes set as follows PartialCluster must be set to 0 AutoStartUp must be set to 0 e PostO
249. ll bring the resource to the offline state 5 3 8 Clearing a sysnode Wait state Clear any node in the Wait state as follows 140 U42117 J Z100 4 76 Administration RMS procedures gt Right click on the node and select the Online or Offline option from the pop up menu The clearing of the Wait state for a node will be ignored unless the Shutdown Facility SF timeout has been exceeded CLI The syntax for the CLI is as follows hvutil 0o SysNode This command clears the Wait state for the specified SysNode on all cluster nodes after the SF failed to kill the cluster node SysNode by returning the specified SysNode to the online state If the SysNode is currently in the Wait state and if the last detector report for the SysNode is in the online state the Wait state is cleared and the SysNode goes back to the online state as if no kill request had ever been sent Caution Manually clearing the SysNode Wait state by using either hvutil o SysNode cftool k or the GUI causes RMS CF and SF to believe that the node in question has been confirmed to be down Doing so without the node really being down can lead to data corruption U42117 J Z100 4 76 141 RMS procedures Administration 5 3 9 Displaying environment variables Display the global or clusterwide environment variables as follows gt Right click on a cluster in the RMS tree window and select View Environment Figure 110 ES Cluster A
250. ller objects If set to 1 then neither PreOff1ine nor Offline requests will be propagated to child applications If 0 then requests will be propagated Must be 1 for a Fol 1 ow controller Must be 0 for a Scalable controller 336 U42117 J Z100 4 76 Appendix Attributes Attributes managed by configuration wizards e Ignore0nlineRequest Possible Values 0 1 Default 1 Valid for controller objects If set to 1 then neither PreOn1 ine nor Online requests will be propagated to child applications If 0 then Online requests will be propagated Must be 1 for a Follow controller Must be 0 for a Scalable controller e IgnoreStandbyRequest Possible Values 0 1 Default 1 Valid for controller objects If set to 1 then neither PreOn1ine nor Online requests during standby processing will be propagated to the child application If 0 then requests will be propagated If the controller is not standby capable then IgnoreStandbyRequest must be set to 1 e IndependentSwitch Possible Values 0 1 Default 0 Valid for controller objects Determines the action of controlled child appli cations when the controlling parent application is switched from one node to another If 0 the parent controller propagetes the switch request to each child appli cation A child application may not be switched on its own If 1 the parent application can be switched to another node without causing a similar switchover for the child
251. llows gt Right click on the application object and select the Online option from the pop up menu Figure 105 EA Cluster Admin AE Cluster Admin File Tools Preferences Help B rua attributes HO tujiarms lapp2 on fuji2RMS User Application o eMe Attribute Value 36 View Application Graph r lo View Subapplication Gray O MO n en Over No UserApplicationfapp2 witch 0 Switch hvswitch Shutdown lo Forced switch hvswitch f b fiers 512 Priority switch hvswitch ate En fuji3 RMS fuji2RMS Offline hvutil f ity o Clear fault hwutil c fault 10 NoDisplay 10 JlArliation llapp2 Comment 1app2 1 5974 2003 09 04 17 22 02 Scripts ScriptTimeout 1300 PreCheckScript hvexec p app2 mydemo PreOnlineScript m fiusroptrelianttmp app2 goingoffline if YSHV_INTENDED_STATEY t Onli PreOfflineScript hvenable app2 ALL rm fJusroptirelianttmp app2 online touch usr optreliantt OffineDoneScript rm f usroptreliantitmp app2 goingofline Online it Ofline Deact Faulted Unknown Inconsistent stand By Warning OfflineFault Ec rmsapes sis msg a Java Applet Window Figure 105 Starting an application 134 U42117 J Z100 4 76 Administration RMS procedures CLI The syntax for the CLI is as follows hvswitch f userApplication SysNode The hvswitch command manu
252. logs may vary from detector to detector The valid range of values is 1 to 9 0 means logging is turned off Any modification to this setting takes affect the next time the configuration is activated The command actually creates the file lt RELIANT_LOG_PATH gt etc wizardloglevel with its contents being the numerical value of the desired debug level A value of zero in the file turns debugging off Alternatively you can create the file wizardloglevel file manually If the file exists a default debugging level of 3 is used The debug level can be modified by inserting a numerical value in the file It is important to realize that by turning on debugging in this manner all detectors will be affected and print out the additional debugging infor mation 190 U42117 J Z100 4 76 Troubleshooting PCS log files Be aware that turning on the debugging levels in this manner should only be done when problems occur and for debugging purposes Once any problems are resolved debugging should again be turned off so as not to unnecessarily fill up the file system with extraneous information and cause the file system to fill 7 10 PCS log files The PCS log file is var opt SMAW 1og pcs log The PCS log file may be useful to service personnel if an internal program error occurs lf the Trace Option is set several trace files are created in the var opt SMAW SMAWpcs trace directory Trace files may help service personnel diagnose internal pro
253. lt childobject gt belonging to a deleted application elicits this response from RMS because deleting an appli cation automatically causes all its children to be deleted as well Action Do not try to delete an object belonging to an already deleted application e ADM 49 24 Dynamic modification failed deleted object lt objectname gt belongs to a deleted application Any attempt to delete an object lt objectname gt that belongs to a deleted application leads to this error because deleting an application deletes all its children including lt objectname gt Action Make sure that before an object is deleted it does not belong to an appli cation that is being deleted e ADM 50 40 Dynamic modification failed cannot delete object lt object gt since it is a descendant of a new object When RMS gets a directive to delete an object lt object gt which is a descendant of a new object dynamic modification aborts and this message is the result Action Make sure that when an object is being deleted it is not a descendant of a new object e ADM 51 15 Dynamic modification failed cannot link to child lt childobject gt since it will be deleted When RMS gets a directive to link to a child lt childobject gt that is going to be deleted dynamic modification aborts Action Do not link to a child object which is to be deleted U42117 J Z100 4 76 217 ADM Admin command and detector queues Non fatal er
254. ly caused by manual editing or distribution of the configu ration file Action Use only PCS or the Wizard Tools to create and activate your configu ration If you have used only the standard tools and this error persists contact field support e US 42 A State transition error occured See the next message for details 293 U42117 J Z100 4 76 WLT Wait list Fatal error messages A state transition error occured in the course of RMS state transitons Details of the error are printed in the subsequent lines Action Save the error description and contact field support 9 15 WLT Wait list e WLT 9 sdtool notification timed out after lt timeout gt seconds After dynamic modofication the Shutdown Facility is notified via sdtool about the changes in the current configuration If this notification does not finish within the period specified by the local SysNode Script Timeout value the base monitor must exit Action Verify that sdtool and Shutdown Facility are properly operating Increase the ScriptTimeout value if needed 9 16 WRP Wrappers e WRP 40 The length of the type name specified for the host host is lt length gt which is greater than the maximum allowable length lt maxlength gt RMS will exit now The length of the interconnect name is greater than the maximum value Action Make sure that the interconnect name is less than the maximum value of maxlength 294 U42117 J Z100 4 76
255. me e BM 24 81 Dynamic modification failed some resource s supposed to come standby failed During dynamic modification when new resource s that are to be added to a resource that is Standby cannot be brought Standby this message is the result Action Analyze your configuration to make sure that standby capable resources can get to the standby state e BM 25 82 Dynamic modification failed standby capable controller lt controller gt cannot control application lt appname gt which has no standby capable resources on host lt sysnode gt In order for an application lt appname gt to be controlled by a controller lt controller gt the application lt appname gt has to have at least one standby capable resource on host lt sysnode gt Action Make sure that the controlled application has at least one standby capable controller or make sure that the controllers are not standby capable e BM 26 83 Dynamic modification failed controller lt controller gt cannot have attributes StandbyCapable and Ignore StandbyRequest both set to 0 This message appears when user sets both controller attributes Stand byCapable and IgnoreStandbyRequest to 1 Action Make sure that only one is set to 1 and other to 0 U42117 J Z100 4 76 235 BM Base monitor Non fatal error messages e BM 29 84 Dynamic modification failed controller object lt controller gt cannot have its attribute Follow set to 1 while o
256. me fuji N A fuji2 fuji3 Device dev hmel N A dev hme3 dev hme3 IP Address 172 25 219 161 N A 172 25 219 83 172 25 219 84 Netmask 255 255 255 0 N A 255 255 255 0 255 255 255 0 Cluster Interconnect Device Name 1 N A N A dev hmel dev hmel Device Name 2 N A N A dev hme2 dev hme2 Device Name 3 N A N A dev ip0 dev ip0 Cluster IP Name N A N A fuji 2RMS fuji 3RMS Address N A N A 92 168 1 1 92 168 1 2 Administrative LAN Name fujiSCON fujiRCA Fuji2ADM fuji3ADM Device dev hme0D N A dev hme0 dev hme0 IP Address 172 25 200 1 172 25 200 2 72 25 200 4 72 25 200 5 Netmask 255 255 255 0 255 255 255 0 255 255 255 0 255 255 255 0 Table 5 Cluster site planning worksheet This example assumes etc hosts contains the following entries which follow the RMS naming convention host names for RMS 192 192 192 192 168 192 168 192 168 168 168 168 1 1 LZ Led 1 21 1 12 122 fuji2RMS fuji3RMS fuji2rmsAl01 fuji2rmsAl02 fuji3rmsAl01 fuji3rmsAl02 Se RR SE alternate for alternate for alternate for alternate for fuji2 fuji2 fuji3 fuji3 U42117 J Z100 4 76 59 Adding hosts to the cluster Configuration example In this step you will add all of these hosts to the cluster gt Atthe Main configuration menu enter the number 16 The Add hosts to a cluster menu appears Figure 19 Creation Add hosts to a cluster Current set 1 HELP 2 QUIT 3 RETURN 4 FREECHOICE
257. messages QUE Message queues NOD 38 cluster host host is no longer in time sync with local node Sane operation of RMS can no longer be guaranteed The time on the cluster host lt host gt differs significantly gt 5 times the hvdet_node interval from the local node Action Make sure that all the cluster hosts are in time sync NOD 40 command gethostbyname returned NULL for host hostname If there is a problem in the detector when resolving a host lt hostname gt this message is the result and the detector exits with exit code 114 Action Make sure that you provide a valid host name 8 14 QUE Message queues QUE 13 RCP fail filename is being copied If there is an attempt to copy the file with name filename when there is another copy in progress this message is the result Action Make sure that concurrent copies of the same file do not occur QUE 14 RCP fail fwrite errno errno There was a problem while transferring files from one cluster host to the other Action Take action based on the errno 8 15 SCR Scripts SCR 8 Invalid script termination for controller lt controller gt The controller script is not correct or invalid U42117 J Z100 4 76 259 SWT Switch requests hvswitch command Non fatal error messages Action Check the controller script SCR 9 REASON failed to execute script lt script gt with resource lt resource gt errorreason The detecto
258. mic modification if there is an attempt to make a non critical resource lt resource gt MonitorOn1y while it is not online and the appli cation lt appname gt is Online this message is the result along with dynamic modification aborting Action Switch the userApplication Offline before making the resource critical 200 U42117 J Z100 4 76 Non fatal error messages ADC Admin configuration e ADC 38 76 Dynamic modification failed application lt appname gt has no children or its children are not valid resources If RMS finds that the userApplication lt appname gt will have no children while performing dynamic modification this message is printed out to the switchlog and dynamic modification is aborted Action Make sure that the userApplication has valid children while performing dynamic modification e ADC 39 The putenv has failed failurereason The wizards use the environment variable HVMOD_HOST during dynamic modification This variable holds the name of the host on which hvmod has been invoked If this variable cannot be set with the function putenv then this message is printed to the switchlog along with the reason failurereason Action Check the reason failurereason in the switchlog to find out why this operation has failed and take corrective action based on this e ADC 41 The Wizard action failed command Wizards make use of an action file during hvmod If the execution of this action fil
259. n In this case each of the display windows closes and a new display at the same position is displayed Figure 95 illustrates the display containing AppA and AppB before RMS is shutdown and Figure 96 shows the RMS GUI after RMS has been restarted with a different configuration that uses app1 and app2 ES Cluster Admin R PES File Tools Preferences Help Bru Attributes HO tyjarms AppA on fuji2RMS User Application amp D Appa Rms attribute Value JAutoRerower n D me Autost CATE ed T o po Pad eFailure o AppA gt DO AppB EARMS cluster FUJI _jorx Applications _tuji2 O fuji3 Appa e a AppB a e Online it Ofline Deact Faulted Unknown Inconsistent stand By Warning OfflineFault Ter rmstpos is HS Java Applet Window gt Java Applet Window 7 Show State Names Java Applet Window imp app1 goingoffline if HV_INTENDED_STATE On 4 Figure 95 Before RMS is shut down U42117 J Z100 4 76 123 Using Cluster Admin Administration PRIMECLUS Appa deleted from RMS configuration on fuji3RMS Figure 96 After RMS is restarted with a different configuration 124 U42117 J Z100 4 76 Administration RMS procedures 5 3 RMS procedures Each of the following sections presents two alternative methods e GUI The Cluster Admin interface is the preferred method of operat
260. n RMS CreateMachine RMS RemoveMachine Figure 63 Return to Main configuration menu gt Select QUIT by entering the number 2 This ends the activation phase of the configuration process 4 14 Starting RMS At this point you are ready to start RMS to monitor both applications You can use the Cluster Admin GUI see the section Starting RMS on page 126 or you can enter the following command from any machine in the cluster hvcm a mydemo This ends the configuration example U42117 J Z100 4 76 89 Starting RMS Configuration example 90 U42117 J Z100 4 76 Administration Overview 5 Administration This chapter describes PRIMECLUSTER administration using the Cluster Admin graphical user interface GUI In addition some command line interface CLI commands are discussed This chapter discusses the following e The section Overview on page 91 introduces PRIMECLUSTER adminis tration by means of the Cluster Admin and the CLI e The section Using Cluster Admin on page 91 discusses how to use the RMS portion of the GUI e The section RMS procedures on page 125 describes how to Administer RMS using the GUI It also contains CLI commands as a convenience for advanced users 5 1 Overview RMS administration can be done by means of the Cluster Admin GUI or by the CLI however it is recommended that you use the Cluster Admin GUI The CLI should only be used by expert
261. n column and then click OK Figure 99 EARMS Start Menu x Notes This dialog will allow you to start RMS on one or more available nodes Nodes shown are ina state in which they can be started Start RMS all available nodes one node from the list Selection Ok Cancel fi ava Applet Window Figure 99 RMS Start Menu for individual nodes 128 U42117 J Z100 4 76 Administration RMS procedures Alternatively you can start RMS on individual nodes directly from the Cluster Admin window 1 In the left pane click the rms tab to view the cluster tree 2 Right click on the node and select StartRMS from the pop up menu Figure 100 EA Cluster Admin 01 Cluster Admin File Tools Preferences Help Bru Attributes MO wirms fuji2RMS System Node amp I app2 RMS Attribute Value cmComm hvem c mydemo De appt MonitorOnly 0 7 1 une e JO alSrtrms o IO a View switchlog PRIMECLUSTER Configuration Services PCS iptreliantbin tools d hvalert ANY ERROR Sysnode fuji2RMS faulted oniine Owsit Offline ODeact Ss Unknown inconsistent stand By Warning ottlineraut Ter rmsapes Tsis ms Java Applet Window Figure 100 Starting RMS on individual nodes CLI The syntax for the CLI is as follows hvem E c config_file a s SysNode The hvcm
262. n depends on directly or indirectly You can use any graph for access to the following features Configuration information from a graph Command pop ups RMS graph customization Node status after RMS is shut down These graphs and their features are explained in more detail in the sections that follow 5 2 5 1 RMS full graph The RMS full graph displays the complete configuration of the cluster Figure 81 The graph represents the following items in the RMS configuration e Relationships between objects e Dependencies of objects e Object types e Current node state You can see the RMS full graph by right clicking on a system node The RMS graph is drawn from the perspective of a particular system node that is the state information of all the nodes is displayed as viewed from a particular system node You can view an RMS graph from the perspective of any of the system nodes The node name in the title bar of the graph identifies the node that is supplying the state information 108 U42117 J Z100 4 76 Administration Using Cluster Admin RMS Figure 81 RMS full graph U42117 J Z100 4 76 109 Using Cluster Admin Administration 5 2 5 2 Application graph You can see a graph for a single application by right clicking on an application The application graph shows all the resources used by that specific application You can also look at specific resource properties The applicat
263. n item in an object s pop up menu that can cause state changes to that object a confirmation pop up window appears Figure 74 To proceed with the action described in the warning message click Yes to cancel the action click No Confirm Online Application app2 h xi 3053 Are you sure you wish to bring application app2 online on SysNode fuji2RMS Java Applet Window Figure 74 Confirmation pop up window For a scalable userApp1ication object the confirmation pop up lists the controlled applications and warns that their states can also change with the specified action Figure 75 Confirm Switch Scalable Application AppO to SysNode fuji3RMS ww 3151 Are you sure you wish to take scalable application AppO offline on host fuji2RMS and bring it online on SysNode fuji3RMS This might also affect the state s of the following controlled application s App1 App2 lavaApplet Window SS Figure 75 Confirmation pop up window for scalable application 102 U42117 J Z100 4 76 Administration Using Cluster Admin 5 2 4 5 Switchlogs and application logs The switchlog on individual system nodes can be viewed by using the View Switchlog option from the system node command pop up window The switchlog is displayed in a tab on the right side panel Figure 76 Mvar optireliantlog switchlog on fuji2 RMS Time Filter Enable StartTime 2003 Er lo Emo Ep le Ebo Em aries 2002 Sy fo odio Pole Eo
264. n this manner All log files which are older than the number of days specified in this variable are deleted by a cron job e RELIANT_LOG_PATH Possible values Any valid path Default var opt SMAWRrms 1og Specifies the directory where all RMS and RMS wizard log files are stored e RELIANT_PATH Possible values Any valid path Default opt SMAW SMAWRrms Specifies the root directory of the RMS directory hierarchy Users do not normally need to change the default setting 344 U42117 J Z100 4 76 Appendix Environment variables Local environment variables e RELIANT_SHUT_MIN_WAIT Possible values O MAXINT Default 150 Seconds Defines the period in seconds that the command hvshut waits before timing out and generating an error message This variable must be set to the maximum value required to successfully terminate offline processing for a specific application This value corresponds to the maximum time required by an application to go offline on all cluster nodes if the a hvshut option is used If this value is too low the hvshut command will time out and generate an error message However this does not mean that the shutdown process is stopped it merely means that the hvshut command itself will time out The shutdown process continues within the RMS system This means that the system shuts down successfully after an hvshut command has timed out even though the command has exited 14 2 Local environment
265. nds clmtest 351 mipcstat 351 pcs_reinstall 351 pcscui 351 pcstool 351 PersistentFault attribute 339 physical disks state at initialization 156 PostOfflineScript attribute 330 script 23 PostOnlineScript attribute 330 script 23 PreCheckScript attribute 339 script 22 PreOffline processing 158 PreOfflineScript attribute 330 PreOnlineScript attribute 331 script 23 PreserveState effect on fault processing 160 161 PreserveState attribute 331 U42117 J Z100 4 76 393 Index primary management server 92 PRIMECLUSTER 9 priority switch request 165 PriorityList attribute 331 privileges 93 procedures 125 R rcfs_fumount 350 rcfs_list 350 rcfs_switch 350 rcsd 355 rcsd cfg 355 Revm resource wizard 33 Reliant Monitor Services clusterwide table 119 components 20 full graph 108 graphs 108 high availability 10 main window 97 overview 10 tree 97 RELIANT_HOSTNAME 347 RELIANT_INITSCRIPT 348 RELIANT_LOG_LIFE 344 RELIANT_LOG_PATH 171 344 RELIANT_PATH 344 RELIANT_SHUT_MIN_WAIT 345 RELIANT_STARTUP_PATH 348 request 149 blocking 167 offline 157 request triggered scripts InitScript 22 OfflineScript 23 OnlineScript 23 PreCheckScript 22 PreOnlineScript 23 resource wizards Gds 33 Gls 33 Ipaddress 33 Revm 33 Vxvm 33 Resource attribute 339 resources clearing faulted 25 configuring 33 defining 16 dependant 11 executing scripts 45 file system entries 36 38 LAN interfaces 36 monit
266. ne of OnlineTimeout or StandbyTimeout is not null The controller node lt controller gt should have one of its attributes Online Timeout or StandbyTimeout be null to allow the attribute Follow to be 1 Action Set the attributes accordingly and try again e BM 42 87 Dynamic modification failed application lt appname gt iS not controlled by any controller but has one of its attributes ControlledSwitch or ControlledShutdown set to l This message appears when the user wants the application lt appname gt to be controlled by a controller but one or more of the applications attributes Control 1edSwitch or Control 1edShutdown is set to 1 Action Set the attributes accordingly and try again e BM 46 89 Dynamic modification failed cannot modify a global attribute lt attribute gt locally on host lt hostname gt The user cannot modify global attributes lt attribute gt like Detec torStartScript or NullDetector or NonCritical locally on a host lt hostname gt Action Modify the attribute globally or modify locally a different attribute e BM 54 The RMS CF CIP mapping cannot be determined for any host due to the CIP configuration file lt configfilename gt missing entries Please verify all entries in lt configfilename gt are correct and that CF and CIP are fully configured CIP configuration file has missing entries Action Make sure that the CIP configuration has entries for all the RMS hosts tha
267. nents e PRIMECLUSTER family of products e Solaris or Linux operating system e Non PRIMECLUSTER products such as volume managers and storage area networks 1 1 About this manual This manual is structured as follows e The chapter Introduction on page 9 contains general information on RMS and introduces the PRIMECLUSTER family of products e The chapter Using the Wizard Tools interface on page 31 describes how to configure RMS using the RMS Wizards e The chapter Configuration example on page 57 illustrates the Wizard configuration process for two simple applications on a small cluster e The chapter Administration on page 91 discusses how to administer RMS by means of the Cluster Admin GUI e The chapter Advanced RMS concepts on page 147 provides details about state detection and transition processing in RMS e The chapter Troubleshooting on page 169 describes how to troubleshoot RMS using graphical user interface GUI and command line interface CLI tools e The chapter Non fatal error messages on page 195 lists all RMS error messages written to the log file along with their causes and resolutions U42117 J Z100 4 76 1 Related documentation Preface The chapter Fatal error messages on page 281 lists all fatal RMS error messages written to the log file along with their causes and resolutions The chapter Console error messages on page 295 lists all RMS error
268. ng a configuration 40 Using the wizard menus 0 o 41 Main configuration menu o 42 Main configuration menu when RMS is not active 42 Main configuration menu when RMS is running 46 Secondary Menus 46 Basic and non basic settings 47 Activating a configuration 49 Configuration elements 0 o 53 SCMIPIS e e i E taaa eaan oe eae ok be ae 53 Detectors anaa ae ean RE Ra eS 54 RMS objects A eg 2 ae a a bk aap ad 54 Further reading e ee eee eee 55 Configuration example 57 Stopping RMS 020000 e 57 Creating a configuration 00 58 Adding hosts to the cluster o 58 Creating an application 0 61 Entering Machines Basics settings 64 Entering non basic settings o 68 Specifying a display o 0000022 e 70 Adding Alternatelps to the cluster Linux only 73 Activating the configuration 77 Creating a second application 0 79 Setting up a controlling application 83 Specifying controlled applications 84 Activating the configuration a second time 88 Starting RMS 0 o e e 89 U42117 J Z100 4 76 Contents 5 3 8 5 3 9 5 3 10
269. ng dynamic reconfiguration RMS calculates the configuration checksum by using usr bin sum If this fails then this message is printed and RMS exits with the exit code 52 Action Check if usr bin sum is available U42117 J Z100 4 76 283 BM Base monitor Fatal error messages e BM 51 The RMS CF interface is inconsistent and will require operator intervention The routine routine failed with errno errno errorreason While setting up CF if RMS encounters a problem in the routine routine that can either be dlopen or dlsym it exits with exit code 95 or 94 respectively The errorreason gives the reason for the error Action Contact field support e BM 58 Not enough memory RMS cannot continue its opera tions and is shutting down This is a generic message that is printed out to the switchlog before RMS discontinues its functioning because it does not have enough memory for it to operate Action Contact field support e BM 67 An error occurred while writing out the RMS config uration after dynamic modification RMS is shutting down Upon concluding dynamic modification RMS dumps out its current configuration into a file var tmp config us If this cannot be done RMS cannot recalculate configuration s checksum Therefore it shuts down Action The previous message in switchlog explains why RMS has not been able to write down the configuration file Please correct the host environment a
270. ng out the necessary information out to a count file If this file cannot be opened for writing the above message is printed Action Contact field support e FATAL ERROR RMS has failed to start Internal error Action Contact field support U42117 J Z100 4 76 303 Console messages in alphabetical order Console error messages e File open failed path errorreason If the file path that is used by the hvassert utility to communicate with the RMS base monitor could not be opened this message is the result along with the reason errorreason for this failure hvassert then exits with exit code 5 Action Contact field support e Forced shut down on the local cluster host When the detector restarts the base monitor it prints this message before proceeding Action None required e Fork failed If RMS is unable to fork a process it prints this message and exits with exit code 1 Action Contact field support e hvsend dest_object is not specified If hvsend has been provided an unknown option in the input file this message is printed and hvsend exits with exit code 9 Action Make sure that you specify a valid option e hvutil Could not determine if RMS is running on lt targethost gt errno exitcode Printed when hvutil A targethost is called indicating that the command failed to ascertain whether or not RMS is running on targethost The exitcode indicates a value in usr include sys
271. ngth Action Ensure that the length of the SysNode name is less than maxlength 9 4 CML Command line 9 5 CMM Communication CML 14 ERROR Unable to find or Invalid configuration file HH HAHAECONFIGURATION MONITOR exits 4HHEHHE The configuration file specified for RMS is non existent RMS exits with exit code 1 Action Specify a valid configuration file for RMS to function CMM 1 Error establishing outbound network communication U42117 J Z100 4 76 285 CRT Contracts and contract jobs Fatal error messages If there is an error in creating outbound network communication this message is the result and RMS exits with exit code 12 Action System error Contact field support e CMM 2 Error establishing inbound network communication If there is an error in creating inbound network communication this message is the result and RMS exits with exit code 12 Action System error Contact field support e CMM 3 Create queue error NODE_SYS_Q The NODE_SYS_Q is used by the RMS base monitor to communicate the list of SysNode objects to hvdet_node If there is a problem creating this queue for some reason RMS exits with exit code 12 Action Contact field support 9 6 CRT Contracts and contract jobs e CRT 6 Fatal system error in RMS RMS will shut down now Please check the bmlog for SysNode information A system error has occurred within RMS Action Please contact field support
272. nly one scalable controller can control an application Action Fix RMS configuration e BM 99 97 Dynamic modification failed controlled appli cation lt controlledapp gt runs on host lt hostname gt but it is controlled by a scalable controller lt scontroller gt which belongs to an application lt controllingapp gt that does not run on that host Hostname mismatch between controlled and controlling applications Controlling application must run on all the hosts where the controlled applications are running Action Fix RMS configuration e BM 101 99 Dynamic modification failed controlled appli cation lt controlledapp gt runs on host lt hostname gt but it is controlled by a scalable controller lt scontroller gt which belongs to a controlling application lt controllingapp gt that does not allow for the controller to run on that host Hostname mismatch between controlled and controlling applications Controlling application must run on all the hosts where the controlled applications are running Action Fix RMS configuration e BM 105 100 Dynamic modification failed Invalid kind of generic resource specified in DetectorStartScript lt script gt for object lt object gt Wrong value is supplied for a flag k in the detector startup script Action Fix RMS configuration U42117 J Z100 4 76 241 BM Base monitor Non fatal error messages e BM 106 The rkKind attribute of object lt
273. nning at the same time hvutil Provides general administration interface to RMS It performs various resource administration tasks such as dynamically setting logging levels sending a resource Offline clearing faulted resources or hung cluster nodes in the Wait state and setting detector time periods and so forth Table 1 Available CLI commands U42117 J Z100 4 76 25 Object types Introduction 2 7 Object types An object type represents a group of similar resources that are monitored by the same detector for example all disk drives Using the RMS Wizards you can create configuration files that contain objects of various types each repre senting resources or groups of resources to be monitored by RMS The supported types are as follows e SysNode e userApplication e gResource e and0p e or0p e controller Refer to the chapter Appendix Object types on page 323 for the supported types their required attributes and a description of each object This information is provided for reference only These objects are i created by the RMS Wizards during the Configuration Activate phase of the configuration process Refer to the chapter Using the Wizard Tools interface on page 31 2 8 Attributes An attribute is the part of an object definition that specifies how the base monitor acts and reacts for a particular resource during normal operation An attribute can include a devic
274. no errno RMS has been unable to execute a script lt scripr gt for the object lt objectname gt The error number errno returned by the operating system provides a diagnosis of the failure RMS exits with exit code 8 Action Consult the system manual pages or the appendix of this manual for the explanation for error number errno and see if the cause is evident If not contact field support e SCR 15 node_sys_q cannot be accessed U42117 J Z100 4 76 291 SYS SysNode objects Fatal error messages The queue node_sys_q is used by the detector hvdet_node to get the list of the SysNode objects from the RMS base monitor if there is some problem with this queue this message is printed and RMS exits with exit code 12 Action Contact field support SCR 18 Message send failed to node_sys_q The RMS base monitor uses the queue node_sys_q to send the list of SysNode objects to hvdet_node after hvmod the initial one on startup or the subsequent ones when hvmod has been invoked explicitly If RMS is unable to send this information to hvdet_node this message is printed and RMS exits with exit code 2 Action Contact field support SCR 26 The sdtool notification script has failed with status status after dynamic modification After dynamic modofication Shutdown Facility is notified via sdtool about the changes in the current configuration If sdtool exits abnormally then the base monitor must exit Action
275. ns to control any number of user applica tions e Cluster Admin The Cluster Admin GUI is the primary administrative tool for RMS RMS also provides integrated services for market specific applications See your sales representative for availability and details 2 2 How RMS provides high availability RMS provides high availability of a customer s application by controlling and monitoring the state of all resources in use by a given application Resources include items such as network interfaces local and remote file systems and storage area networks RMS also monitors the state of each host in the cluster 2 2 1 Applications resources and objects Within RMS each resource used by an application is represented as an object and each object is configured with the following e Detectors 10 U42117 J Z100 4 76 Introduction How RMS provides high availability e Scripts e Dependent resources RMS monitors each resource by using detectors which are processes that report resource states to the RMS base monitor process Resources are typically reported as online enabled available or offline disabled unavailable but a variety of other states is possible according to the type of resource Each resource type has an associated set of scripts Some scripts are reactive they define the actions that RMS should take in response to state changes Other scripts are proactive they define the actions that RMS should use to take
276. nsole error messages e resource is not in state state Ifthe hvassert on an object resource for a state state discovers that the resource is not in that state this message is printed and hvassert exits with exit code 1 Action None required e timestamp NOTICE User has been warned of hvshut f and has elected to proceed When the user invokes hvshut f and then has elected to proceed with the command then this message is printed to confirm that hvshut f is being invoked Action None required e lt command gt failed with exit code exitcode When the hv1ogc1 ean utility is invoked without the d option it executes the command command if this command could not be executed for some reason it returns the exit code exitcode and then the utility exits with exit code 6 Action Take action based on the exit code exitcode e Assertion condition failed If hvassert fails while using f or F options this message is printed and hvassert exits with exit code 1 Action None required e BEWARE hvshut f may break the consistency of the cluster No further action may be executed by RMS until the cluster consistency is re established This re estab lishment includes restart of RMS on the shut down host Do you wish to proceed yes shut down RMS no leave RMS running 298 U42117 J Z100 4 76 Console error messages Console messages in alphabetical order This is a message asking fo
277. ntact field support e CUP 5 object received unknown contract The contract received by the node from the application is not recog nizable Critical internal error Action Contact field support e CUP 7 appname is locally online but is also online on another host U42117 J Z100 4 76 247 DET Detectors Non fatal error messages The user application is already online on other host and is also online in current host Action User application can only be online on one host Make sure the appli cation is offline on all but one of the hosts If this is not the case use hvutil1 f to bring the userApplication to an Offline state on the super fluous hosts e CUP 8 object could not get an agreement about the current online host cluster may be in an inconsistent condition If the cluster hosts are unable to reach an agreement as to which host is responsible for a particular userApplication The most likely reason for this is that due to an erroneous system administrator intervention e g a forced hvswitch request the userApplication is Online on more than one host simultaneously Note This message corresponds to CUP 2 While CUP 8 is printed on the contract originator CUP 2 is printed on the non originator hosts Action Analyze the cluster inconsistency and perform the appropriate action to resolve it If the application is online on more than one host shut down hvutil f the userApplic
278. o hvdisp o 296 U42117 J Z100 4 76 Console error messages Console messages in alphabetical order e command message queue is not ready yet The command command relies on a message queue to transmit messages to the RMS base monitor If this message queue is not available for some reason this message is the result and the utility exits with exit code 3 Action Contact field support e command Must be super user to issue this command This message indicates that in order to run the command command the user should have root privileges Action Make sure that the user has root privileges before issuing the command command RMS is not running When the command command has been invoked it checks to make sure that RMS is running if not this message is the result and the utility exits with exit code 2 Action Make sure that RMS is running before invoking the different utilities e directory cannot put message in queue The various RMS commands like hvdisp hvswitch hvutil and hvdump utilize the lock files from the directory directory for signal handling purposes These files are deleted after these commands are completed The locks directory is also cleaned when RMS starts up If they are not cleaned for some reason this message is the result RMS exits with exit code 99 Action Make sure that the locks directory directory exists U42117 J Z100 4 76 297 Console messages in alphabetical order Co
279. obal Settings 5 Application Remove 4 Configuration Consistency Report 6 Application Clone 5 Configuration ScriptExecution 7 Configuration Generate 6 RMS CreateMachine 8 Configuration Activate 7 RMS RemoveMachine 9 Configuration Copy Choose an action Figure 18 Main configuration menu 4 3 Adding hosts to the cluster Before you configure an application you must define the cluster so that it includes all hosts on which the application may run The names of all possible RMS hosts should have already been added to the etc hosts file see the section Site preparation on page 34 To override a default RMS primary host name edit that host s i hvenv local file and set the RELIANT_HOSTNAME variable to the desired name The contents of that host s RELIANT_HOSTNAME variable must match the corresponding etc hosts entry on every host in the cluster This must be done before you add the host to the cluster in this step 58 U42117 J Z100 4 76 Configuration example Adding hosts to the cluster Select the nodes to be included in the configuration The worksheet in Table 5 will be used as an aid to complete this configuration in an orderly fashion See Appendix Cluster planning worksheet in the Installation Guide Cluster Name FUJI Cluster Console RCA Node 1 Node 2 Node Name N A N A fuji2 fuji3 Public LAN Na
280. ocess Turns on full RMS and detector logging with the 10 lowercase L zero option Gathers system and users times for the process The truss 1 strace 1 invocation and logging levels will be termi nated after the number of seconds specified in the ScriptTimeout attribute All information is stored in the switchl og file 194 U42117 J Z100 4 76 8 Non fatal error messages This chapter contains a detailed list of all non fatal RMS error messages that appear in the switchlog Most messages are accompanied by a description of the probable cause s and a suggested action to correct the problem In some cases the description or action is self evident and no further information is necessary Some messages in the listings that follow contain words printed in italics These words are placeholders for values names or strings that will be inserted in the actual message when the error occurs RMS error code description A prefix in each message contains an error code and message number identi fying the RMS component that detected the problem You may need to provide this prefix to support engineers who are diagnosing your problem The following list summarizes the possible error codes and the associated component ADC Admin configuration ADM Admin command and detector queues BAS Startup and configuration errors BM Base monitor CML Command line CRT Contracts and contract jobs CTL Controllers CUP userAp
281. on RMS Wizards detector logging on page 189 Unlike RMS which logs most of its messages in the switchl og file the RMS Wizards log everything at an application level All messages associated with a particular configured application are logged in the file lt RELIANT_LOG_PATH gt lt application_name gt 109 The file is created when either offline or online processing for the application begins Each RMS Wizard process that is run generates the following two types of log messages e User e Debug U42117 J Z100 4 76 187 Wizard log files Troubleshooting The log messages are contained in the following files e switchlog Records RMS events relevant to the user such as switch requests and fault indications The RMS Wizards record resource state transitions into the switchl og file e lt application_name gt 10g The application specific log file records all messages associated with that application The output from all scripts run by the application go into the log file e hvdet_xxx gnnlog These are detector log files which record all relevant information regarding the resources they are monitoring like all state transi tions The format of most RMS Wizard messages is as follows resource_ name state timestamp message_type Message delimiter There is a colon between each field of the message The resource_name field is the name of the particular resource node in the RMS graph whose script is running
282. on tools used to create and manage applications in an RMS config uration See also RMS Wizard Kit Reliant Monitor Services RMS SAN See Storage Area Network Scalable Internet Services SIS Scalable Internet Services is a TCP connection load balancer and dynamically balances network access loads across cluster nodes while maintaining normal client server sessions for each connection scalability The ability of a computing system to dynamically handle any increase in work load Scalability is especially important for Internet based applica tions where growth caused by Internet usage presents a scalable challenge SCON See single console script RMS A shell program executed by the base monitor in response to a state transition in a resource The script may cause the state of a resource to change service node SIS Service nodes provide one or more TCP services such as FTP Telnet and HTTP and receive client requests forwarded by the gateway nodes See also database node SIS gateway node SIS Scalable Internet Services SIS 370 U42117 J Z100 4 76 Glossary shared resource A resource such as a disk drive that is accessible to more than one node See also private resource RMS resource RMS simple virtual disk RCVM Simple virtual disks define either an area within a physical disk partition or an entire partition See also concatenated virtual disk RCVM striped virtual disk RCVM
283. oncepts Online processing Manual request using the GUI To manually generate an online request perform the following steps 1 Using the graph left click on an application a pop up menu is displayed 2 Right click on the switch or online selections within the pop up menu Manual request using the CLI To generate an online request for each userApplication use the hvswitch command Refer to the hvswitch manual page for details on usage and options 6 4 1 2 Automatic methods Both automatic methods can only invoke a priority switch Automatic request at RMS startup When RMS first starts on a cluster it switches the userApp1ication online on the highest priority host Automatic switch at RMS startup only occurs under the following conditions e All SysNodes associated with a specific application are online e userApplication is not online on any other cluster node e AutoStartuUp attribute of the userApplication is enabled These limitations ensure that the userApp1icationis not started on more than one cluster node at a time Automatic request when a fault occurs RMS initiates a priority switchover when it detects either a fault of a userAp plication or a fault of a SysNode on which a userApp1iction was online This automatic switchover occurs only if the AutoSwitchOver attribute of the userApplication is enabled 6 4 2 Online processing in a logical graph of a userApplication Relative to the resource graph
284. ondent userApp1ication on a remote host is in the DeAct state but the local userApp1ication is not Critical internal error Action Contact field support 268 U42117 J Z100 4 76 Non fatal error messages UAP userApplication objects e UAP 28 object failed to update the priority list Cluster may be in an inconsistent state When the local host receives a contract for unlocking the hosts in the cluster with respect to a particular operation if the local host finds that a particular host has died it updates its priority list to reflect this but if it is unable to perform this operation due to some reason this message is the result This indicates a critical internal problem in memory management Action Contact field support e UAP 29 object contract data section is corrupted This message appears when the application is unable to read the data section of the contract Action Contact field support e UAP 32 object received unknown contract This message appears when the application unable to unlock the contract as it was unable to find the kind of contract request in its code that it expected Critical internal error Action Contact field support e UAP 33 object unknown task in list of outstanding contracts This message appears when a userApplication object finds a task in the list of outstanding contracts but unable to process it as it could not able to find the kind of contract request in its
285. oo Pe anh ee Pare a ayo nad eh Ree aa Ba i 349 15 2 CE ett pao gs ae ale en ee 2 Bd eae A 349 15 3 GRS tte eee OS id Oe a BE Sed 350 15 4 CIP iaa a aod ak Ge Qed Bo AS he ee eS 350 15 5 Monitoring Agent o 351 15 6 PAS A A a Se ye 351 15 7 PES a daa i ts a ee BSS 351 15 8 ROM Maira ra we SMe A e a eae a a 352 15 9 Resource Database o e 352 15 10 AMS a a to od ah alan dies de to Hoe de 353 15 11 RMS Wizards a i soena o e e 354 15 12 SOON ara as als Goa a ae goa ed Jed A 355 15 13 E ld a A a a 355 15 14 SIS sita a aa She cade kad 356 15 15 Web Based Admin View 0 a 356 Glossary chia a a a Pe bee 8 357 Abbreviati0NS o 375 FIQUTOS o eraud r ena E a a eG 379 Tables i iria aaa ee ot ALS ew ee Pe Be ii 385 de soo as eh a et a hk ts ee eo ee Boe 387 U42117 J Z100 4 76 1 Preface PRIMECLUSTER Reliant Monitor Services RMS is a software monitor designed to guarantee the high availability of applications in a cluster of nodes After an introduction to RMS terminology and principles of operation this manual describes how to configure RMS using the RMS Wizards and how to administer RMS using the Cluster Admin GUI The manual is aimed at system administrators and programmers familiar with installing and maintaining RMS configurations Those who configure and administer RMS should be familiar with the following system functions and compo
286. or valid string of the form timeout_valuel offline_value online_value Default 300 Valid for all object types Specifies the timeout value for all scripts associated with that object in the configuration file Use the string format to specify individual timeout values of offline_value for Of flineScript and online_value for OnlineScript e ShutdownPriority Possible Values 0O MAXINT Default 0 Valid for userApplication objects ShutdownPriority assigns a weight factor to the application that is used by the Shutdown Facility When interconnect failures and the resulting concurrent node elimination requests occur SF calculates the shutdown priority of each subcluster as the sum of the subcluster s SF node weights plus the RMS ShutdownPri ority of all online application objects in the subcluster The optimal subcluster is defined as the fully connected subcluster with the highest weight e StandbyCapable Possible Values 0 1 Default 0 Valid for resource objects If set to 1 the object performs standby processing on all nodes where the parent application is supposed to be Offline 332 U42117 J Z100 4 76 Appendix Attributes Attributes available to the user The user can modify this attribute for a cmd1 ine subapplication only The configuration tools control this attribute for all other subapplications e StandbyTimeout Possible Values 0O MAXINT in seconds Default 0 Valid for controller objec
287. oring 54 non basic settings 49 object types 26 scripts 21 shared remote entries 37 states 10 rKind 339 RMS graph 44 naming conventions 35 59 RMS CLI 23 hvassert 24 hvattr 24 hvcm 24 hvconfig 24 hvdisp 24 hvdist 24 hvdump 24 hvgdmake 24 hvlogclean 25 hvrclev 25 hvreset 25 hvshut 25 hvswitch 25 hvthrottle 25 hvutil 25 switching userApplication 23 Cmdline 33 RMS commands Controller 33 hvassert 353 Fsystem 33 hvem 24 129 353 394 U42117 J Z100 4 76 Index hvconfig 24 353 hvdisp 24 145 353 hvdist 24 353 hvdump 24 354 hvenv local 354 hvexec 53 hvgdmake 24 354 hvlogclean 25 354 hvrclev 25 hvreset 25 354 hvshut 25 354 hvshut command 133 hvswitch 25 135 354 hvthrottle 25 354 hvutil 25 138 354 RMS Wizard Kit 17 19 detectors 19 hvw command 19 overview 10 scripts 19 RMS Wizard Tools 17 detectors 19 hvw command 19 overview 10 resource types 19 scripts 19 RMS Wizards See wizards 61 RMS See Reliant Monitor Services rName attribute 339 root privileges 93 running processes 16 S SA_blade cfg 355 SA_rccu cfg 355 SA_rps cfg 355 SA_rsb cfg 355 SA_scon cfg 355 SA_sspint cfg 355 SA_sunF cfg 355 SA_wtinps cfg 355 scalability 9 scalable controller state change script 23 Scalable mode controllers 14 Scalable attribute 331 SCON reply time 165 scon 355 script time out 165 scripts 11 22 allocating 150 Offline 19 Online 19 resources 21 RMS Wizard Kit 19 timeout 348 S
288. ost attempts to prepare its own configuration for a subsequent transfer to the remote host For that it uses the command lt command gt lf the lt command gt fails the hvjoin operation is aborted Action Contact field support e ADC 59 Failed to store remote configuration files on this host Command used lt command gt When this host joins a cluster this host attempts to store remote config uration files for a subsequent dynamic modification on this host For that it uses the command lt command gt l the lt command gt fails the hv join operation is aborted Action Contact field support e ADC 60 Failed to compress file lt file gt Command used lt command gt File transfer is a part of some RMS operations such as dynamic modifi cation and hv join Before transferring a file lt file gt to a remote host it must be compressed with the command lt command gt If the lt command gt fails the operation that requires the file transfer is aborted Action Contact field support e ADC 61 Failed to shut down RMS on host lt host gt While performing RMS cluster wide shutdown RMS on host lt host gt failed to shut down Action Contact field support e ADC 62 Failed to shut down RMS on this host attempting to exit RMS While performing RMS clusterwide shutdown RMS on this host failed to shut down Another attempt to shut down this host is automatically initiated 204 U42117 J Z100 4
289. ot shut down This message could appear if hvshut a was invoked and not all of the nodes replied with an acknowledgement Action Login to the remote hosts If RMS is still running perform hvuti1 f lt appname gt to shut down each application one at a time If this fails refer to the switchlog and lt appname gt 1 og files to find the reason for the problem If all applications have been shut down correctly perform a forced RMS shutdown with hvshut f Report the problem to RMS support e ADM 70 NOT ready to shut down The reason for this message is If the node on which hvshut a has been invoked is not yet ready to be shut down because the application is busy on the node Action 220 U42117 J Z100 4 76 Non fatal error messages ADMI Admin command and detector queues Wait until the ongoing action e g switchover dynamic reconfiguration has terminated e ADM 75 57 Dynamic modification failed child lt resource gt of userApplication object lt appname gt has HostName attribute lt hostname gt common with other children of the same userAppli cation This message occurs if the RMS internal sanity check functions detect a severe configuration problem This message should not occur if the configuration has been set up using RMS configuration wizards Action Contact field support e ADM 76 Modification of attribute lt attribute gt is not allowed within existing object lt object gt
290. ources and applications are brought Online or Offline in the correct order e Initiates and controls automatic application switchover as required by a CLI request or in case of a resource or node failure e Performs various administrative functions 2 6 2 Detectors and states Detectors are independent processes that monitor specific sets of resources in order to determine their state The detector does not determine if the current state of a resource is the correct state or not for example if a resource is Offline but is supposed to be Onl ine that is the role of the base monitor Detectors can report the following states to the base monitor Online Enabled ready for use All required children are online and no errors were encountered while scripts were processed Offline Disabled not ready for use The scripts have successfully deconfigured the resource Faulted Error condition encountered The error may have occurred in the resource in one of its children or during script processing Standby Ready to be quickly brought On1 ine when needed Warning Some warning threshold has been exceeded Also reported when ascalable controlled application is in transition from Online to Offline or from Standby to Faul ted a scalable controller object is Online but some of its controlled applications are not acontrolling application is Online but some of its scalable controller objects report Warning OfflineFault
291. ove all invalid entries in the RMS default configuration file Refer the hvcm man page e RMS has failed to start multiple entries in the RMS default configuration file configfilename The user is not allowed to start RMS if there are multiple entries in the default configuration file config us Action The user has to remove all the obscure entries in the RMS default config uration file and has to have only one valid configuration in it e RMS has failed to start RELTANT_HOSTNAME is not defined in the RMS environment The environment variable RELIANT_HOSTNAME is not properly set Action Ensure that the RMS environment variable RELIANT_HOSTNAME wasn t set erroneously to null string or explicitly unset in hvenv local e RMS has failed to start the number of arguments specified at the command line overrides the internal buffer of the RMS start utility This message appears when the number of arguments specified at the command line is more than the buffer capacity 30 command line arguments Action Refer to the hvcm manual page for the correct syntax and usage U42117 J Z100 4 76 311 Console messages in alphabetical order Console error messages e RMS has failed to start the number of arguments specified at the RMS default config uration file configfilename overrides the internal buffer of the RMS start utility This message appears when the user tries to start the RMS using the RMS default configu
292. p between specific resources local area network See public LAN local node The node from which a command or process is initiated See also remote node node log file The file that contains a record of significant system events or messages The base monitor wizards and detectors can have their own log files MDS See Meta Data Server message A set of data transmitted from one software process to another process device or file message queue A designated memory area which acts as a holding place for messages Meta Data Server GFS daemon that centrally manages the control information of a file system meta data mirrored disks RCVM A set of disks that contain the same data If one disk fails the remaining disks of the set are still available preventing an interruption in data avail ability See also mirrored pieces RCVM mirrored pieces RCVM Physical pieces that together comprise a mirrored virtual disk These pieces include mirrored disks and data disks See also mirrored disks RCVM U42117 J Z100 4 76 365 Glossary mirror virtual disk RCVM Mirror virtual disks consist of two or more physical devices and all output operations are performed simultaneously on all of the devices See also concatenated virtual disk RCVM simple virtual disk RCVM striped virtual disk RCVM virtual disk mount point The point in the directory tree where a file system is attached multihost
293. pecifying the basic and non basic settings for your application and achieving a consistent result you have successfully finished the Application Create part of the configuration procedure gt Select SAVE EXIT by entering the number 3 This will take you back to the RMS configuration menu 72 U42117 J Z100 4 76 Configuration example Adding Alternatelps to the cluster Linux only 4 8 Adding Alternatelps to the cluster Linux only To maintain high availability RMS can employ multiple physical network connections to each host in the cluster For RMS purposes one connection to each machine is associated with the primary host name Redundant connec tions to the same machine are associated with alternate interfaces known as Alternatelps For high reliability operation Alternatelps should be included in the configuration In our example both fuji2 and fuji3 have a total of three connections to the network See the etc hosts entries in the section Adding hosts to the cluster on page 58 The primary host names were specified when the cluster was defined In this step two Alternatelps will be added for each machine Configure your applications and all their associated nodes Machines lists before you add Alternate lps If a node is not used by any appli cation neither its primary name nor its AlternatelIps will be available in the menus described below gt From the Main configuration menu select 15 Configur
294. plet Window Figure 112 Local environmental variables window CLI Display the environment variables with the hvdisp command which does not require root privilege hvdisp ENV hvdisp ENVL 144 U42117 J Z100 4 76 Administration RMS procedures 5 3 10 Displaying application states The application states of various applications are indicated by different colors The legend for the application states appears in the RMS main window below the RMS Tree panel Figure 113 BEE Menster Adan A EA Cluster Admin File Tools Preferences Help controtteroo1 of_ct_ap Non affiliated ShutdownPriority ControlledSwitch ControlledShutdown Brus Attributes MO tujiarms app2 on fujiZRM 06 appz RMS Attribute Value F AutoRecover Rae Smo APES AutoStartUp o DO nanagerrogramooo AutoSwitchOver No o CtlLAPP2 Class UserApplication app2 0 o 0 oD a MaxControllers 512 Q B Oo fuji3RMS PreserveState 0 og app2 PriorityList fuji3RMS fuji2RMS R OnlinePriority o O app PersistentF ault 0 NoDisplay 0 Affiliation lapp2 Comment app2 15974 2003 09 04 17 22 02 Scripts ScriptTimeout 300 PreCheckScript hvexec p app2 mydemo PreOnlineScript PreOfflineScript OffineDoneScript rm f usroptrelianttmp app2 goingoffline if H _INTENDED_STATE Onil hvenable app2 ALL rm f usr optrelianttmp app2 online touch usroptreliantt rm f usroptrelianttmp app2 goingoffline
295. plication contracts DET Detectors GEN Generic detector INI init script MIS Miscellaneous NOD Node detector QUE Message queues SCR Scripts SWT Switch requests hvswitch command SYS SysNode objects UAP userApplication objects US us files WLT Wait list WRP Wrappers U m U42117 J Z100 4 76 195 ADC Admin configuration Non fatal error messages 8 1 ADC Admin configuration e ADC 1 Since this host lt hostname gt has been online for no more than time seconds and due to the previous error it will shut down now time is the value of the environment variable HV_CHECKSUM_INTERVAL if set or 120 seconds otherwise This message could appear when the checksums of the configurations of the local and the remote host are different no more than time seconds have elapsed and one of the following is true When the remote host is joining the cluster and all the applications on the local host are either Offline or Faulted RMS exits with exit code 60 The configuration for the local host does not include the remote host but the configuration for the remote host does include the local host The local host hostname will shut down with exit code 60 Action The local and the remote hosts are running different configurations Make sure that both of them are running the same configuration e ADC 2 Since not all of the applications are offline or faulted on this host lt hostname gt
296. plication is important the Halt attribute can be set in the userApp1i cation during the configuration procedure This attribute ensures that the local host is shut down immediately if RMS cannot resolve a double fault state The other hosts detect this as a system failure and RMS transfers the applications running on the failed host to another host PreserveState without AutoSwitchOver If the PreserveState attribute is set and the AutoSwitchOver attribute is not set in the userApp1lication the process is as follows 1 The userApplication does not initiate any further activity after the fault script executes 2 All nodes remain in their current state Use this attribute if an application can remedy faults in required resources Neither AutoSwitchOver nor PreserveState If neither the AutoSwitchOver attribute nor the PreserveState attribute is set RMS carries out offline processing as a result of the fault but it does not initiate a switchover after offline processing is complete successful or not Both AutoSwitchOver and PreserveState If both the AutoSwitchOver attribute and the PreserveState attribute are set RMS ignores the PreserveState attribute and responds as if only the AutoSwitchOver attribute were set Directed switch fault A special case occurs when a directed switch request causes a fault during offline processing In this case RMS carries out a switchover after completing the offline processing that the fa
297. pplication acts as a resource that must be online if the teller application is to function properly teller application Ipaddress database resource application Fsystem resource Figure 2 Controlled application scenario Controlling application Controlled application RMS accommodates parent child relationships between applications by providing a Controller object which is often simply called a controller Like resource objects a controller is configured with detectors and scripts the detectors monitor the state of the child controlled application and the scripts implement appropriate responses by the parent controlling application 12 U42117 J Z100 4 76 Introduction How RMS provides high availability Figure 3 demonstrates how RMS would represent the banking scenario For the purposes of this example only the application and controller objects are included in the illustration resource objects representing network interfaces or file systems are not shown Note that each controlled application requires a separate controller in the parent application and that controllers exist only for internal RMS management purposes there is no equivalent within the context of the user s applications SysNode object cluster node userApplication object controlling application teller application O Controller object database application Figure 3 RMS representation of controlled application user
298. pplication below it This gives a composite view of all the subapplications that the first application depended on directly or indirectly Figure 84 If the controlled appli cation has further controller objects then the process is recursively repeated CtlL_APP2 2JControtlero01 Of_CH_APP2 appt EN Cmd_APP1 Java Applet Window Figure 84 Composite subapplication graph 112 U42117 J Z100 4 76 Administration Using Cluster Admin 5 2 5 5 Configuration information from a graph Click the left mouse button on the object of interest to see the configuration information of the object in a graph form A pop up screen displays the attributes Figure 85 B fuji2 RMS Tal Ez AO Node ile ka SysNode cnComm hvem c mydemo MonitorOnly ci NoDisplay See Scripts ScriptTimeout 300 FaultScript usr opt reliant bin tools Java Applet Window lava Applet Window Figure 85 Configuration information pop up U42117 J Z100 4 76 113 Using Cluster Admin Administration 5 2 5 6 Command pop ups You can use the context sensitive command pop up menus on the RMS graph nodes to perform many operations Invoke the pop up menu by right clicking on an object The menu options are based on the type and the current state of the selected object Figure 86 BA Cluster Admin SJ rus MO tujierms View Subapplication Graph
299. printed Action Provide a valid value for the delay e It may take few seconds to do Debug Information collection As the hvdump utility dumps out the information regarding the resource graph it prints this message while it is collecting the information Action None required e localfile filename does not exist or is not an ordinary file If the localfile specified as an argument to hvrcp does not exist or if it is not a regular file hvrcp exits with exit code 7 Action Make sure that the localfile exists and is an ordinary file e Modification file name is missing on the command line usage hvmod i 1 f config_file us E L i 1 c modification directives When the hvmod utility is invoked with an option that does not conform to its expected usage this message is the result and the utility exits with exit code 2 Action Follow the expected usage for the utility U42117 J Z100 4 76 307 Console messages in alphabetical order Console error messages e Name of the modification file is too long If the name of the modification file specified as an argument through the f option or the modification directives specified via the c option are greater than 113 this message is printed and hvmod exits with exit code 4 Action Make sure that the arguments specified via f and c options are not too long e NOTICE User has been warned of hvshut f a and has elected to proceed Wh
300. professional services WRP 25 Child process lt cmd gt with pid lt pid gt has exceeded its timeout period Will attempt to kill the child process The child process cmd has exceeded its timeout period Action Please contact professional services e WRP 28 RMS monitor has encountered an irregular sequence of timer interrupts off by lt offset gt seconds This may have been caused by a manual OS time change or by an unusually high OS performance load or by some other OS condition If this error appears frequently then normal RMS operations can no longer be guaranteed it can also lead to a loss of heartbeats with remote hosts and to an elimination of the current host from the cluster 276 U42117 J Z100 4 76 Non fatal error messages WRP Wrappers The RMS base monitor keeps track of the regularity of its timer interrupts that are supposed to occur every second If the interrupts become irregular due to a high load manual time change or any other reason the above notice is printed If the discrepancy value becomes too high or if this error appears frequently then this might lead to a malfunction of the RMS base monitor which can cause a loss of High Availability Action Do not attempt to change the system date time by any significant value while RMS is running Raise the priority of the RMS base monitor to ensure that it has enough CPU time to perform its operations during a high load e WRP 29 RMS on th
301. r The detector then exits with exit code 130 256 U42117 J Z100 4 76 Non fatal error messages NOD Node detector Action Contact field support e NOD 22 The interconnect interconnect to the cluster host host failed Action Critical error Contact field support e NOD 25 The network connection to the cluster host host failed Action Critical error Contact field support e NOD 26 detector detector can t report resource state If the detector lt detector gt cannot report the state of the other SysNodes in the cluster to the base monitor running on the same host as the detector this message is the result This is most likely a problem with the queue when the detector is reporting the state Action Contact field support e NOD 28 detector SysNode list empty in hvdet_node The hvdet_node contacts the base monitor to get the list of SysNodes but if it just gets an empty list back in return this message is the result RMS then exits with exit code 129 Action Contact field support e NOD 29 The RMS CF interface is inconsistent and will require operator intervention The routine routine failed with error code errorcode errorreason This is a generic message indicating that the execution of the routine lt routine gt failed due to the reason lt errorreason gt and hence the RMS CF interface is inconsistent Depending on which routine lt routine gt has failed the detector hvde
302. r manual action would be required To have the switchover procedure carried out automatically you have to select 16 AutoSwitchOver in this menu and then specify the desired mode s from the menu that follows Figure 27 Set flags for AutoSwitchOver Currently set NO N 1 HELP Cs 3 SAVE RETURN S 4 DEFAULT 5 NO N 6 HOSTFAILURE H 7 RESOURCEFAILURE R 8 SHUTDOWN S Choose one of the flags 6 Figure 27 AutoSwitchOver mode gt Seta flag by entering the number 6 for HOSTFAILURE This means that RMS switches an application to another node automatically in the case of a node failure Set ags for AutoSwitchOver Currently set HOSTFAILURE H 1 HELP 2 3 SAVE RETURN 4 DEFAULT 5 NO N 6 NOT HOSTFAILURE H 7 RESOURCEFAILURE R 8 SHUTDOWN S Choose one of the flags 7 Figure 28 Setting flags for AutoSwitchOver mode gt Enter the number 7 for RESOURCEFAILURE see Figure 28 This means that RMS switches an application to another node automatically in the case of a resource failure gt Enter the number 3 for SAVE RETURN see Figure 28 66 U42117 J Z100 4 76 Configuration example Entering Machines Basics settings You will be returned to the Machines Basics menu Figure 29 Note that item 16 now displays the AutoSwitchOver flags you just set Consistency check Machines Basics appl consistent 1 HELP a 3 SAVE EX
303. r primar 172 25 219 83 bsecondar fuji3 Global Clust ces Version Konsera Web Based Admin View Figure 66 Top menu 94 U42117 J Z100 4 76 Administration Using Cluster Admin Open Cluster Admin as follows 1 Click on Global Cluster Services 2 Click on the Cluster Admin button to start Cluster Admin 3 The Choose a node for initial connection screen appears Figure 67 Select a node and click on OK The main Cluster Admin screen appears Server pr iman 172 25 219 83 Secondary fuji3 Global Cluster Services R Logout NodeList Version 3 Cluster Admin ee Choose a node for initial connection uji2 uji3 Java Applet Window Figure 67 Cluster menu 5 2 3 Main screen The main screen Figure 68 contains the following tabs on the left hand side panel e cf rms amp pcs O sis e msg message window U42117 J Z100 4 76 95 Using Cluster Admin Administration ES Cluster Admin BEES Cluster Admin PRIMECTUSTER File Tools Statistics Help 3 Fus Main BOn Node States fuji2 ruis MO mi Buji2 Que e ur EA fuji3 UP e uP vi Show State Names All cluster nodes are up and operational cf _Ims pcs sis msg J Legend C Monitored by CIM El Monitored but Overridden Java Applet Window Figure 68 Main screen Select the appropriate tab to switch to
304. r confirmation from the user if he wants to proceed with hvshut f If the user elects to proceed yes would be the appropriate answer and a no if there is no intention of going ahead with it Action Respond to the prompt e BEWARE the hvreset command will result in a reinitial ization of the graph of the specified userApplication This affects basically the RMS state engine only The re initial ization does not mean that activities invoked by RMS so far will be made undone Manual cleanup of halfway configured resources may be necessary Do you wish to proceed yes reset application graph no abort hvreset Action Respond to the prompt e Can t open modification file When hvmod is invoked with the c option it utilizes a temporary file if this file cannot be opened for writing this message is the result and hvmod exits with exit code 1 Action Contact field support e Cannot start RMS BM is currently running RMS is already running on the local host Action Shut down the currently running version of RMS and restart e Change dest_object to node Action None specified U42117 J Z100 4 76 299 Console messages in alphabetical order Console error messages e Command aborted A command has prompted for reconfirmation If the user answers with a no to the question of whether he wants to proceed with the command this message is printed and the command is aborted Action None require
305. r script is not good or the format is not good Action Check the detector script SCR 20 The attempt to shut down the cluster host host has failed errorreason The cluster host could not be killed because of one of the following reasons Script exited with a non zero status Script exited due to signal caught Other unknown failure Action Verify the status of the node make any necessary corrections to the script potentially correct the node state manually if possible and issue appropriate hvutil o u as needed SCR 21 Failed to execute the script lt script gt errno lt errno gt error reason lt errorreason gt If the script cannot be executed this message is printed out along with the errorreason Action Take action based on the errorreason 8 16 SWT Switch requests hvswitch command SWT 4 object is online locally but is also online on onlinenode If the object object is online on more than one host this message is the result Action 260 U42117 J Z100 4 76 Non fatal error messages SWT Switch requests hvswitch command Make sure that the object object is online on only one host in the cluster e SWT 20 Could not remove host lt hostname gt from local priority list A host has left the cluster but RMS was unable to remove the corre sponding entry from its internal Priority List This is an internal problem in the program stack and memory management Action
306. ransmit this contract a certain number of times which is determined U42117 J Z100 4 76 245 CTL Controllers Non fatal error messages internally If this message transmission fails even after all these attempts this message is printed to the switchlog and this contract is discarded UAP contract is not discarded Action Make sure that there is no problem with the cluster interconnect If contract retransmissions occur for userApp1 ication contracts make sure that the cluster is in consistent condition i e no userApplication is online on more than one host no SysNode is in a pending wait state etc CRT 5 The contract lt crtname gt is being dropped because the local host lt crthost gt has found the host originator lt otherhost gt in state lt state gt That host is expected to be in state Online Please check the interhost communication channels and make sure that these hosts see each other Online The local host crthost sees the contract host originator in state state when it is expected to be in state Online Action Make sure that the interhost communication channels are working correctly and that the hosts see each other online 8 7 CTL Controllers CTL 1 Controller lt controller gt will not operate properly since its controlled resource lt resource gt is not in the config uration This message appears when a resource is not in the configuration that is controlled by a controller and the controll
307. ration file but unable to do so because the number of arguments specified in the RMS default configuration file overrides the internal buffer of the RMS start utility Action Remove some of the unwanted arguments from the RMS default config uration file Check the man page for hvcm to get the required options to start RMS e RMS has failed to start the options a and s are incompatible and may not be specified both This message appears when the user tries to start RMS uses the options a and s simultaneously Action Check the man page for hvcm to get the format e rms is dead The hvrcp utility checks whether the RMS base monitor is alive every 10 seconds if it finds that it is not alive it prints this message and exits with exit code 1 Action Get RMS running on the host e RMS on node node could not be shutdown with hvshut A Action none specified e Root access required to start hvcm To start RMS the user must have root access Action login as root and try hvcm 312 U42117 J Z100 4 76 Console error messages Console messages in alphabetical order Sending data to resource This message is printed when logging is turned on and data is being sent to the object resource Action None required Shutdown of RMS has been aborted When the user invokes hvshut L the hvshut utility asks for a confir mation if the answer is no then hvshut L is aborted and this message is printed ou
308. rauited i inconsistent werning A Figure 114 Invoking the log viewer 174 U42117 J Z100 4 76 Troubleshooting Using the log viewer You can search the logs based on any of the following Resource name Date time range Keyword filter Severity levels Exit codes You can also search in the log display window by right clicking on the displayed text This brings up a Find pop up window Figure 115 Nvar optireliantlog app2 log on fuji2RMS Attributes Environment Time Filter Enable Start Time 2002 EA 10 m2 Ep m1 3 Em End Time 2002 E 10 Em Eb is En 19 Em Keyword Filter Resource Name No Selection A Severity No Selection w Non zero exit code Keyword A Fitter app2 PreCheck 2002 10 24 14 44 37 NOTICE Processing prechecks for application app ManageProgram000_Cmd_APP2 Online 2002 10 24 14 44 42 NOTICE starting touch st ManageProgram000_Cmd_APP2 Online 2002 10 24 14 44 42 ManageProgram000_Cm 2002 10 24 14 45 11 NOTICE enabl etection for ControllerO0DOf_app2 2 2002 10 24 14 45 11 NOTICE M tcction for Cmd_APP2 2002 10 24 14 45 11 NOTICE enable resource detection for AllControllersOk_app2 w 4 gt Status Done Detach Remove Hep Figure 115 Find pop up window U42117 J Z100 4 76 175 Using the log viewer Troubleshooting Detach the log by clicking on t
309. rce wizard 33 G Gds resource wizard 33 GENERIC turnkey wizard 33 80 Global Disk Services 33 global environment variables 27 Global Link Services 33 Gls resource wizard 33 graph RMS 44 graphical user interface See GUI graphs 108 application 110 command pop ups 114 composite subapplication 112 configuration information 113 reinitializing 25 gResource 54 object type 323 required attributes 323 GUI 125 after shut down 118 messages 146 pull down menus 96 starting RMS 97 H Halt attribute 328 high availability 1 9 specifying applications 43 HostName attribute 336 hosts site preparation 34 HV_AUTOSTART_WAIT 342 HV_AUTOSTARTUP_IGNORE 341 HV_CHECKSUM_INTERVAL 342 HV_CONNECT_TIMEOUT 345 HV_LOG_ACTION 346 HV_LOG_ACTION_THRESHOLD 343 HV_LOG_WARN_THRESHOLD 343 HV_MAX_HVDISP_FILE_SIZE 346 HV_MAXPROC 346 HV_RCSTART 346 HV_REALTIMEPRIORITY 346 HV_SCRIPTS_DEBUG 347 HV_SYSLOG_USE 347 HV_WAIT_CONFIG 344 hvem I 182 hvdisp command 27 29 144 145 file size 346 no display 338 hvenv and hvenvl local files 27 changing variables 28 hvexec command 53 hvreset defined 25 hvshut defined 25 hvshut command defining timeout 345 stopping RMS 133 hvswitch defined 25 hvswitch command userApplication 135 hvswitch f 163 hvutil 165 hvutil c 163 hvutil command defined 25 shutting down an application 138 hvw command 19 defined 40 operation mode 45 resuming configuration 79 l _List attribute 328 IgnoreOfflineRequest a
310. re 89 RMS graph with affiliation names and resource names U42117 J Z100 4 76 117 Using Cluster Admin Administration 5 2 5 8 Node status after RMS is shut down After RMS is shut down the RMS GUI windows become dark gray on the node from which they are getting their information Figure 90 In this condition all the states are white indicating that the states are unknown The main window and the clusterwide table continue to show the application states until RMS is shut down on all nodes SHUTDOWN fuji2 RMS fuji3RmS J flapp2 app2 app2 app pp p1 pp lt lapp2 Machine001_app2 app2 Machine000_app2 E Jlap MachineO01_appt fapp1 Machine000_appt Eq AllControllersOk_appt Controllerd000 1 appi C app ICrd_APP1 Cmd_APP gacen d_APP ManageProgram000_Cmd_APP _AllControllersOk_Cmd_APP2 JICt_APP2 Controller0010f_Ctl_APP Figure 90 RMS graph after RMS is shut down 118 U42117 J Z100 4 76 Administration Using Cluster Admin 5 2 6 RMS clusterwide table The RMS clusterwide table displays the state information about userApp1i cation objects as a summary table The user can see the state of each of the userApplication objects on each of the system nodes It presents the infor mation in a concise manner Open the clusterwide table through a pop up menu option for the cluster node root node in the RMS tree The clusterwide table comes up in a sep
311. re that the state specified for hvassert is assertable U42117 J Z100 4 76 295 Console messages in alphabetical order Console error messages e command bad timeout timeout If the timeout specified for the hvassert command is not a number this message is the result and the utility exits with exit code 1 Action Specify a number for the timeout value of hvassert e command cannot open file filename hvsend is used to send messages to an object in a resource graph It can get the list of messages to send from a file If this file cannot be opened this message is the result and the hvsend utility exits with exit code 8 Action Make sure that the file filename exists e command could not create a pipe If the utility command could not open the tty to be written to this message is the result and the utility exits with exit code 7 Action Contact field support command failed due to undefined variable local_host If the hvsend utility is unable to find the value of the environment variable RELIANT_HOSTNAME this message is the result and it exits with exit code 7 Action Make sure that RELIANT_HOSTNAME is defined e command file already exists When hvdisp o has been invoked by the user and the output file that has been specified as an argument already exists this message is the result and hvdisp exits with exit code 6 Action Specify a filename that does not already exist as the argument t
312. remote host must be skipped Please start RMS manually on remote hosts This message is the result of starting RMS with the a option but due to some internal error RMS could not be started on the remote host Critical internal error Action Contact field support For temporary workaround try again or start RMS manually on each host e ADM 96 Remote startup of RMS failed lt startupcommand gt Reason errorreason When RMS cannot be started on remote hosts because the command lt startupcommand gt failed Action This may occur when some of the hosts are not reachable or the network is down e ADM 98 72 Dynamic modification failed controller lt controller gt has its Resource attribute set to lt resource gt but some of the controlled applications from this list do not exist This message appears when the controller node was not able to find the applications controlled by it with the applications running on the host Action Correct your modification file so that the controllers refer only to the existing application U42117 J Z100 4 76 225 ADM Admin command and detector queues Non fatal error messages CADM 99 73 Dynamic modification failed cannot change attribute Resource of the controller object lt controller gt from lt oldresource gt to lt newresource gt because one or more of the appli cations listed in lt newresource gt is not an existing application or its state is incompatible
313. ribute The node lt node gt has a ScriptTimeout value that is less than its detector report time This will cause a script timeout error to be reported before the detector can report the state of the resource Increase the Script Timeout value for objectname currently value seconds to be greater than the detector cycle time currently value seconds Node lt node gt has no detector while all its children s MonitorOnly attributes are set to l The node lt node gt has both attributes LieOffline and ClusterExclusive set These attributes are incom patible only one of them may be used The type of object lt object gt cannot be or and and at the same time Object lt object gt is of type and its state is online but not all children are online 228 U42117 J Z100 4 76 Non fatal error messages BAS Startup and configuration errors Action Verify the above description and change the configuration appropriately BAS 14 ERROR IN CONFIGURATION FILE The object lt object gt belongs to more than one userApplication userapplicationi and userapplication2 Objects must be children of one and only one userApplication object An object was encountered as a part of more than one user applications RMS applications cannot have common objects Action Redesign your configuration so that no two applications have common objects BAS 15 ERROR IN CONFIGURATION FILE The object lt object gt is a l
314. rmation These will be shipped to RMS support This message just indicates that the hvdump utility will now start collecting the information Action None required e Dynamic modification is in progress can t assert states It is not possible to perform an hvassert when dynamic modification is in progress Action Perform hvassert after dynamic modification finishes e Error becoming a real time process errorreason The RMS base monitor runs as a real time process thereby giving it higher priority over other processes on Solaris If there is a problem in the base monitor becoming a real time process due to errorreason then this message is the result Action Take action based on the reason U42117 J Z100 4 76 301 Console messages in alphabetical order Console error messages e Error setting up real time parameters errorreason If there is a problem while setting up the parameters for the RMS base monitor to run as a real time process this message is the result along with the reason errorreason for the problem Action Take action based on the reason e Error while starting up bm on the remote host lt targethost gt errorreason When hvenm is invoked with the s option to start RMS on a remote host lt targethost gt if there is a problem in starting up RMS on the remote host this message is the result along with the reason for the problem lt error reason gt Action Take action based on the re
315. rnal error Action Contact field support e UAP 19 object SendUAppLockContract LOCK Contract cannot be sent This message appears when the LOCK contract cannot be sent over the network Action The network may be down U42117 J Z100 4 76 267 UAP userApplication objects Non fatal error messages e UAP 21 object SendUAppUnLockContract UNLOCK Contract cannot be sent This message appears when the UNLOCK contract cannot be sent over the network Action The network may be down e UAP 22 object unlock processing failed cluster may be in an inconsistent condition This message appears when the local node receives a UNLOCK contract but is unable to perform the follow up processing which was committed in the contract Action Contact field support e UAP 23 object failed to process UNLOCK contract A host was unable to propagate the received UNLOCK contract e g because of networking problems or memory problems Action This message should appear with an additional ERROR message speci fying the origin of the problem Refer to the ERROR message e UAP 24 Deleting of local contractUAP object failed cannot find object This message appears when the local contract node has completed the contract and has sent it to the local node but the local node could not able to find it Action Contact field support e UAP 27 object received a DEACT contract in state state The corresp
316. rollers that belong to this controlling application must also have their AutoSwitchOver attributes having the option Shutdown set as well Action Provide correct settings for the AutoSwitchOver attributes e BM 81 93 Dynamic modification failed local controller attributes such as NullDetector or MonitorOnly cannot be modified during local modification hvmod 1 The reason for this message is that the modification of local controller attributes such as Nul1Detector or Monitor0nl y are allowed only during global modification Action Make a non local modification or modify different attributes e BM 90 94 Dynamic modification failed The length of object name lt object gt is length This is greater than the maximum allowable length name of maxlength The length of object name is greater than the maximum allowable length Action Ensure that the length of the object name is smaller than maxlength e BM 92 95 Dynamic modification failed a non empty value lt value gt is set to lt ApplicationSequence gt attribute of a non scalable controller lt controller gt U42117 J Z100 4 76 239 BM Base monitor Non fatal error messages A non scalable controller cannot have its ApplicationSequence attribute set to a non empty value Action Provide correct settings for the App1icationSequence and Scalable attributes e BM 94 97 Dynamic modification failed the Application Sequence attribute of a
317. ronment variable wvSetparam set Web Based Admin View environment variable wvstat display the operating status of Web Based Admin View 356 U42117 J Z100 4 76 Glossary Items in this glossary that apply to specific PRIMECLUSTER products are indicated with the following notation e CF Cluster Foundation e PCS PRIMECLUSTER Configuration Services e RMS Reliant Monitor Services e RCVM Volume Manager not available in all markets e SIS Scalable Internet Services Some of these products may not be installed on your cluster See your PRIMECLUSTER sales representative for more information AC See Access Client Access Client GFS kernel module on each node that communicates with the Meta Data Server and provides simultaneous access to a shared file system activating a configuration RMS Preparing an RMS configuration to be run on a cluster This involves two major actions first the configuration is generated on the host where the configuration was created or edited second the configuration is distributed to all nodes affected by the configuration The user can activate a configuration using PCS the RMS Wizards or the CLI See also generating a configuration RMS distributing a configuration RMS Administrative LAN In PRIMECLUSTER configurations an Administrative LAN is a private local area network LAN on which machines such as the System Console and Cluster Console reside Because normal
318. ror messages e ADM 52 16 Dynamic modification failed cannot link to parent lt parentobject gt since it will be deleted as a result of deletion of object lt object gt If there is an attempt to delete an object lt object gt and use its descendants which should be deleted as a result of deleting the parent as the parent for a new resource that is being added to the RMS resource graph this error message is printed and dynamic modification aborts Action Do not attempt to delete an object and use its descendant as the parent for a new resource e ADM 53 26 Dynamic modification failed lt node gt is absent Trying to modify the attribute of a node lt node gt which is absent leads to this error and dynamic modification aborts Action Modify the attributes of an existing node e ADM 54 27 Dynamic modification failed NODE lt object gt attribute lt attribute gt is invalid When RMS receives a directive to modify a node lt object gt with attribute lt attribute gt that has an invalid value this message is the result and dynamic modification aborts Action Specify a valid value for the attribute lt attribute gt e ADM 55 Cannot create admin queue RMS uses Unix queues internally for interprocess communication Admin queue is one such queue that is used for communication between RMS and other utilities like hvuti1 hvmod hvshut hvswitch and hvdisp If RMS cannot create this queue due to some
319. s D60 app2 o J am Ean nae a T 4 Cmd_APP1 iem d APP 2 a Cmd _APP1 15 Cmd_APP2 mi Cmd_APPA 4 Cmd_APP2 e Cmd_APP2 Online Wait Deact Faulted ICH_APP2 Inconsistent Stand By i o CtH_APP2 Cer saves sis msg Java Applet Window oingofine f YSHY_INTENDED_STATE Onl iptireliantitmp app2 online touch usr optreliantty loingoffline mewu Java Applet Window Figure 87 RMS graph with affiliation names Figure 88 shows a graph that displays resource names U42117 J Z100 4 76 115 Using Cluster Admin Administration EE fujiz RMS Loix fuji2RMS aw Pe caine ae Machine001 a SES a Machine001 ee app1 AllControllersOk 4AllControllersOk ControllerO000f app2 ControllerOO00f appi 4 Cmd_APP2 Cmd_APP1 AllCommandLinesOk_Cmd_APP2 JAllCormmandLinesOk Cmd_APP1 CEDIA belies Cmd_APP2 JManageProgram000 Cmd_APP1 AllControllersOk_Cmd_APP2 ControllerOO00fCmd_APP2 CtLAPP2 JAllRealContrallersOk_CtlAPP2 2JController001 Of_CtlL APP2 lava Applet Window Figure 88 RMS graph with resource names 116 U42117 J Z100 4 76 Administration Using Cluster Admin If both options are selected graphs will display both the affiliation names and resource names This combination stretches the graph horizontally and can make it difficult to read Figure 89 BB fujiz RMS Figu
320. s The Messages panel displays error and debug messages related to Cluster Admin View these messages as follows gt Select the msg tab on the bottom of the RMS tree panel This tab turns red if a new message has been added to the text area since it was last viewed i Message text area can be cleared or detached from the main panel 146 U42117 J Z100 4 76 6 Advanced RMS concepts This chapter deals with ongoing RMS operations and provides information on RMS runtime behavior particularly in cases where monitored components fail This chapter discusses the following e The section Internal organization on page 147 briefly describes the object oriented internal aspects of the RMS base monitor e The section States and scripts on page 150 lists the RMS scripts e The section Initializing on page 151 describes the process of transferring the control of nodes to RMS e The section Online processing on page 152 details the transition of a node to the Online state e The section Offline processing on page 157 details the transition of a node to the Of fl ine state e The section Fault processing on page 159 explains how RMS handles fault situations e The section Switch processing on page 165 describes how RMS switches applications to other hosts in the cluster 6 1 Internal organization A brief description of the object oriented internal aspects of the base monitor is use
321. s a number of notational typographical and syntactical conventions 13 1 Notation This manual uses the following notational conventions 1 3 1 1 Prompts Command line examples that require system administrator or root rights to execute are preceded by the system administrator prompt the hash sign Entries that do not require system administrator rights are preceded by a dollar sign In some examples the notation node indicates a root prompt on the specified node For example a command preceded by fuji 2 would mean that the command was run as user root on the node named fuji2 4 U42117 J Z100 4 76 Preface Conventions 1 3 1 2 Manual page section numbers References to operating system commands are followed by their manual page section numbers in parentheses for example cp 1 1 3 1 3 The keyboard Keystrokes that represent nonprintable characters are displayed as key icons such as Enter or F1 For example Enter means press the key labeled Enter Ctrl b means hold down the key labeled Ctrl or Control and then press the key _ 1 3 1 4 Typefaces The following typefaces highlight specific elements in this manual Typeface Usage Constant Computer output and program listings commands file Width names manual page names and other literal programming elements in the main body of text Italic Variables in a command line that you must replace wi
322. s message is the result of the RMS base monitor being unable to extract a message of the ADMIN_Q that is used for communication between the utilities and RMS RMS then exits with exit code 3 Action Contact field support e QUE 5 Network message read failed If there is a problem reading a message over the network this error is the result and RMS exits with exit code 3 Action U42117 J Z100 4 76 289 SCR Scripts Fatal error messages System error Contact field support e QUE 6 Network problem occurred This message is the result of a network problem occurring when trans ferring messages Action System error Contact field support e QUE 11 Read message failed in DET_REP_Q All the detectors use the queue DET_REP_Q to communicate with the RMS base monitor If there is a problem in reading the message of the queue RMS prints this message and exits with exit code 15 Action Contact field support e QUE 12 Error status in DET_REP_Q status This message is the result of the RMS base monitor having a problem with the queue DET_REP_Q that is used by the different detectors to report their state RMS then exits with exit code 15 Action Contact field support 9 11 SCR Scripts e SCR 4 Failed to create a detector request queue for detector detectorname If a detector request queue could not be created for detector detector_name this message is the result and RMS exits with exit code
323. s someone explicitly deactivated it with the following command hvutil d userApplication U42117 J Z100 4 76 139 RMS procedures Administration 5 3 7 Clearing a fault Clear the fault for an application in the Faul ted state as follows gt Right click on the application object and select the Clear Fault pop up menu option Figure 109 ES Cluster Admin File Tools Preferences Help BEES Brus attributes MO tujisrms AppB on fuji3RMS User Application eg Appa RMS Attribute Value AutoRecover GamE 1 BtartUp a O fuji2R A E BwitchOver o DO ma fault hvutil c HownPriority amp I a PAID Forted switch hvswitch 4 Fojjeqshutdown MaxControllers PreserveState PriorityList OnlinePriority PersistentF ault NoDisplay OnlineOnHost Scripts ScriptTimeout o 1 HostFailureResourceFailure 0 o 0 512 0 fuji2RMS fuji RMS 0 0 0 fuji2RMS 300 Online it Offline Deact Faulted Unknown Inconsistent stand By Warning OfflineFault ef rmsapes sis msg Java Applet Window Figure 109 Clearing an application fault CLI The syntax for the CLI is as follows hvutil c userApplication If the userApp1l ication is in the online state then clearing the fault will i cause RMS to attempt to bring the faulted resource to the online state If the userApplication is in the offline state then clearing the fault wi
324. sages for program For example messages from bm the base monitor are recorded in bmlog The prefix for trace messages is as follows time file line The prefix for error messages is as follows time file line ERROR The switchlog file contains the following five message types The first four categories of messages all follow this format timestamp error code error number message type message delimiter There is a colon space between each field of the message where the Informational messages notices Warning messages Error messages Fatal error messages Output from scripts run by RMS timestamp is defined as follows yyyy mm dd hh mm ss xxx Message type is defined as one of the following NOTICE WARNING ERROR FATAL ERROR U42117 J Z100 4 76 185 System log Troubleshooting Messages are any text generated by the RMS product This text can contain one or more new lines The delimiter is defined as a colon followed by a series of four equal signs The last category of messages output from scripts follows no specific format and is merely the redirected standard output and standard error from all scripts defined within the RMS configuration file For example 2001 05 07 11 01 54 568 WARNING InitScript does not exist 7 8 System log The base monitor of RMS writes messages to the switchl og file and also writes the same messages to the system log By default all the RMS
325. scalable controller lt controller gt therefore it cannot have its attribute lt ControlledShutdown gt set to 1 while its attribute lt AutoSwitchOver gt includes option lt ShutDown gt An application controlled by a scalable controller cannot have ControlledShutdown set to 1 and AutoSwitchOver including the option lt ShutDown gt at the same time Action Correct RMS configuration BM 111 104 Dynamic modification failed Line line is too big A line in a configuration file is too big Action Fix RMS configuration so that each line takes less than 2000 bytes 8 5 CML Command line CML 11 Option option requires an operand Certain options for hvcm require an argument If hvcm has been invoked without the argument this message appears along with the usage and RMS exits with exit code 3 Action Check the hvcm man page for correct usage CML 12 Unrecognized option option The option provided is not a valid one Action Check the hvcm man page for correct usage CML 17 Incorrect range argument with 1 option The number for the 1 option is not correct Check the range Action U42117 J Z100 4 76 243 CML Command line Non fatal error messages Check the man page for hvcm for range argument with 1 option e CML 18 Log level lt loglevel gt is too large The valid range is 1 maxloglevel with the 1 option If the loglevel loglevel specified with 1 option for hvcmis greater th
326. scalable controller lt controller gt includes application name lt hostname gt but this name is absent from the list of controlled applications set to the value of lt resource gt in the attribute lt Resource gt The ApplicationSequence attribute of a scalable controller includes an application name absent from the list of the controlled applications Action Provide correct settings for Appl icationSequence and Resource attributes of the controller e BM 96 94 Dynamic modification failed a scalable controller lt controller gt has its attributes lt Follow gt set to 1 or lt IndependentSwitch gt set to 0 A scalable controller must have its attribute Fol ow set to 0 and lt IndependentSwitch gt set to 1 Action Provide correct settings for the Fol low IndependentSwitch and Scalable attributes e BM 97 95 Dynamic modification failed controller lt controller gt attribute lt ApplicationSequence gt is set to lt applica tionsequence gt which refers to application s not present in the configuration A scalable controller must list only existing applications in its Applica tionSequence attribute Action Provide correct settings for attribute Appl icationSequence 240 U42117 J Z100 4 76 Non fatal error messages BM Base monitor e BM 98 96 Dynamic modification failed two scalable controllers lt controllerl gt and lt controller2 gt control the same application lt application gt O
327. sconfiguration e WRP 32 RMS has received a message from host host with IP address receivedip The local host has calculated the IP address of that host to be calcip This may be due to a miscon figuration in etc hosts The local host has received a message from host host with IP address receivedip which is different from the locally calculated IP address for that host This message will be printed in the switchlog for every 25 such messages that have been received as long as the number of received messages is less than 500 if not this message is printed for every 250th such message received Action Check etc hosts for any misconfiguration e WRP 33 Error while creating a message queue with the key lt id gt errno lt errno gt explanation lt explanation gt An abnormal OS condition occurred while creating a message queue Action Check OS conditions that affect memory allocation for message queues such as the size of swap space the values of parameters msgmax msgmnb msgmni msgtq1 Check if the maximum number of message queues have already been allocated e WRP 34 Cluster host host is no longer in time sync with local node Sane operation of RMS can no longer be guaranteed Further out of sync messages will appear in the syslog The time on host is not in sync with the time on the local node Action Sync the time on host with the time on the local node 278 U42117 J Z100 4 76 Non fatal error me
328. sed on a keyword Nvar optireliantiog app2 log on fuji2RMS Time Filter Enable Start Time 2002 jy 10 Em 25 jp 10 Eh 13 jm End Time 2002 jy 10 jm 25 Elo 11 Eth 13 Elm Resource Name No Selection v Severity No Selection v Non zero exit code Keyword detection Filter 2002 10 25 10 24 30 NOTICE enable resource detection for Controllero000f_app2 3 2002 10 25 10 24 30 NOTICE enable resource detection for Cmd_APP2 2002 10 25 10 24 30 NOTICE enable resource detection for AllControllersOk_app2 epee 10 25 qe a at TOULE enable resource detection for Machine000 _app2 Status Done Figure 119 Results of keyword based search U42117 J Z100 4 76 179 Using the log viewer Troubleshooting 7 4 4 Search based on severity levels Search the log files based on severity levels as follows 1 Click on the Severity button 2 Choose one of the severity levels as described in Table 8 3 Click on the Filter button Severity level Description Emergency Systems cannot be used Alert Immediate action is necessary Critical Critical condition Error Error condition Warning Warning condition Notice Normal but important condition Info For information Debug Debug messages Table 8 Descriptions of severity levels 180 U42117 J Z100 4 76 Troubleshooting Using the hvdump command
329. serApplication tn N string resource L 0 1 resource o u SysNode 1 level w Ww 1 all userApplication r m on off forceoff userApplication M on off forceoff This message could appear in any one of the following situations hvutil u is invoked with more than 1 argument Exit code 7 hvutil is invoked without any options or arguments Exit code 7 hvutil is invoked with an illegal option Exit code 7 hvutil i is used without an argument Exit code 13 hvutil r is used with an argument Exit code 14 hvutil w W is used with an argument Exit code 9 hvutil n is invoked with NoConfi rm as the only argument Exit code 5 hvutil m M is invoked with an argument other than on off or forceoff Exit code 16 hvutil m is invoked without an argument or hvutil M is invoked with an argument Exit code 16 U42117 J Z100 4 76 319 Console messages in alphabetical order Console error messages Action Follow the intended usage of hvutil 320 U42117 J Z100 4 76 11 Appendix Operating system error numbers Some RMS error messages display the operating system error number lt errno gt that was returned when a process such as a detector or script failed These error numbers may provide important clues in diagnosing the problem See user document or header files provided with the re
330. sing the Wizard Tools interface Creating and editing a configuration Menu items The Main configuration menu can perform the following activities when RMS is not running anywhere in the cluster e Application Create Specifies which application to configure for high avail ability In addition this operation specifies all the relevant settings for the application so that it can run in a high availability configuration monitored by RMS Among the most important of these settings is the name of the appli cation and the list of nodes on which the application may run The user application should be configured to run on multiple nodes for a high availability configuration The wizard assists you by supplying menus with basic and non basic attributes assigns values to the attributes and prompts you if an attribute is mandatory By choosing the appropriate turnkey wizard for an application the wizard will then provide predefined elements like scripts and detectors for the appli cation in question These elements have been developed especially for the respective type of application The wizard will also carry out consistency checks at certain stages of the configuration procedure in order to prevent inconsistent applications from running in a high availability configuration e Application Edit Modifies an existing application An existing application can be modified using this menu item The following modes are available for ed
331. source e ADM 27 2 Dynamic modification failed child object lt childobject gt is absent Any attempt to link to a child object lt childobject gt that is non existent leads to this message and dynamic modification aborts Action Make sure that the child object to be linked to exists e ADM 28 19 Dynamic modification failed child object lt childobject gt is not a resource When a new object lt childobject gt being added to an existing configuration is not a resource this message is the result and dynamic modification aborts Action Make sure that the child object specified is a resource e ADM 29 3 Dynamic modification failed parent object lt parentobject gt is absent Action Critical error Contact field support e ADM 30 20 Dynamic modification failed parent object lt parentobject gt is not a resource 212 U42117 J Z100 4 76 Non fatal error messages ADM Admin command and detector queues During dynamic modification if there is a request to add a new parent object lt parentobject gt that is not a resource this message occurs and dynamic modification aborts Action Make sure that the object being added as a parent object is a resource e ADM 31 4 Dynamic modification failed child object lt childobject gt is absent As part of dynamic modification if the specified child object lt childobject gt does not exist then this message is the result and dynami
332. sources used to develop or interpret the configuration file See also configuration file RMS template See application template RMS type See object type RMS UP CF A node state that indicates that the node can communicate with other nodes in the cluster See also DOWN CF LEFTCLUSTER CF node state CF virtual disk With virtual disks a pseudo device driver is inserted between the highest level of the OS logical Input Output I O system and the physical device driver This pseudo device driver then maps all logical I O requests on physical disks See also concatenated virtual disk RCVM mirror virtual disk RCVM simple virtual disk RCVM striped virtual disk RCVM Web Based Admin View A Java based OS independent interface to PRIMECLUSTER management components See also Cluster Admin wizard RMS An interactive software tool that creates a specific type of application using pretested object definitions An enabler is a type of wizard 372 U42117 J Z100 4 76 Glossary Wizard Kit RMS See RMS Wizard Kit Wizard Tools RMS See RMS Wizard Tools U42117 J Z100 4 76 373 Glossary 374 U42117 J Z100 4 76 Abbreviations AC Access Client API application program interface bm base monitor CCBR Cluster Configuration Backup Restore CDL Configuration Definition Language CF Cluster Foundation or Cluster Framework CIM Cluster Integrity Monitor CIP
333. ssages ADM Admin command and detector queues Make sure dynamic modification does not contain del ete sysnode where sysnode is the name of the local node e ADM 84 64 Dynamic modification failed cannot add SysNode lt sysnode gt since its name is not valid This message appears in the switchlog if the name lt sysnode gt specified as part of the dynamic modification is not resolvable to any known host name Action Specify a host name that is resolvable to a network address e ADM 85 65 Dynamic modification failed timeout expired timeout symbol is lt symbol gt If the dynamic modification takes too much time this message is the result Action Make sure that the network connection between the hosts is functional and also verify that the scripts from newly added resources do not take too much time to execute or that dynamic modification does not add too many new nodes or that the modification file is too big or too complex e ADM 86 66 Dynamic modification failed application lt appname gt cannot be deleted since it is controlled by the controller lt controller gt A controlled application lt appname gt cannot be deleted while its controller lt controller gt retains the application s name in its Resource attribute Action Remove the name of the deleted application from the controller s Resource attribute or add a new application with the same name or delete the controller together with
334. ssages WRP Wrappers e WRP 35 Cluster host host is no longer in time sync with local node Sane operation of RMS can no longer be guaranteed The time on the cluster host host differs significantly gt 25 seconds from the local node Action Make sure that all the cluster hosts are in time sync e WRP 42 The interconnect lt interconnect gt to cluster host lt host gt has failed The interconnect interconnect has failed to host host has failed Action Fix the interconnect U42117 J Z100 4 76 279 WRP Wrappers Non fatal error messages 280 U42117 J Z100 4 76 9 Fatal error messages This chapter contains a detailed list of all fatal RMS error messages that appear in the switchlog Most messages are accompanied by a description of the probable cause s and a suggested action to correct the problem In some cases the description or action is self evident and no further information is necessary Some messages in the listings that follow contain words printed in italics These words are placeholders for values names or strings that will be inserted in the actual message when the error occurs RMS error code description A prefix in each message contains an error code and message number identi fying the RMS component that detected the problem You may need to provide this prefix to support engineers who are diagnosing your problem The following list summarizes the possible error codes and the
335. startup processes are running but hvdisp hangs 192 U42117 J Z100 4 76 Troubleshooting RMS troubleshooting This problem might occur if the local node is in the CF state LEFTCLUSTER from the point of view of the other at least some other cluster nodes Action Verify the problem by calling cftool n on all cluster nodes to check for a possible LEFTCLUSTER state Call cftool k to clear the LEFTCLUSTER state RMS will continue to run as soon as the node has joined the cluster No restart should be necessary e RMS loops or even dies shortly after being started This problem could occur if the CIP configuration file etc cip cf contains entries for the netmask These entries are useless not evaluated by CIP From the RMS point of view these entries cannot be distinguished from IP addresses which have the same format so RMS will invoke a gethostbyaddr This normally does no harm but in some unusual cases the OS may become confused Action Verify the problem by checking if netmask entries are present in etc cip cf Remove the netmask entries and restart RMS e RMS detects a node failure network connection failed to host but does not even attempt to kill the node This problem could occur if the failed node was already in a pending Wait state from an earlier failed kill request If a kill request fails the SysNode remains in the Wait state until this state is manually cleared by the System Admin
336. ster Foundation functions that the PRIMECLUSTER services use in the layer above See also Cluster Foundation CF 358 U42117 J Z100 4 76 Glossary base monitor RMS The RMS module that maintains the availability of resources The base monitor is supported by daemons and detectors Each node being monitored has its own copy of the base monitor Cache Fusion The improved interprocess communication interface in Oracle 9i that allows logical disk blocks buffers to be cached in the local memory of each node Thus instead of having to flush a block to disk when an update is required the block can be copied to another node by passing a message on the interconnect thereby removing the physical I O overhead CCBR See Cluster Configuration Backup and Restore CF See Cluster Foundation CF child RMS A resource defined in the configuration file that has at least one parent A child can have multiple parents and can either have children itself making it also a parent or no children making it a leaf object See also resource RMS object RMS parent RMS cluster A set of computers that work together as a single computing source Specifically a cluster performs a distributed form of parallel computing See also RMS configuration Cluster Admin A Java based OS independent management tool for PRIMECLUSTER products such as CF RMS PCS and SIS Cluster Admin is available from the Web Based Admin View inter
337. ster host e LEFTCLUSTER event occurs When any of these events happen RMS must first ensure that the host with which contact was lost is down before automatic switchover occurs To accom plish this RMS uses the Shutdown Facility SF For more information about the Shutdown Facility and shutdown agents see the Cluster Foundation CF Configuration and Administration Guide Once the shutdown of the cluster host is verified by the SF all userApp1i cation nodes that were Online on the affected cluster host are priority switched to surviving cluster hosts Descriptions of the shutdown methods are provided in the sections that follow 164 U42117 J Z100 4 76 Advanced RMS concepts Switch processing 6 6 5 1 Operator intervention If SF fails then operator intervention is required The indication that operator intervention is required is the persistent Wait state of any SysNode in the cluster In this instance a persistent Wait state is defined as a SysNode Wait state that lasts longer than the SCON reply time added to the script timeout for the ShutdownScript The value of the SCON reply time can be found by executing the following opt SMAW SMAWRrms bin hvenv grep HV_SCON_REPLY_TIME The value of the script timeout for the ShutdownScript can be found by executing the following opt SMAW SMAWRrms bin hvdisp SysNode_name grep Script Timeout Alternately the administrator can look for a message in the switchlog indicatin
338. striction Note the state of the controller objects in Figure 5 and Figure 6 For each Scalable mode child application an instance of its controller is online on every node where that application can run This architecture allows RMS to efficiently monitor the cluster resources available to each child application regardless of where the application is running at the time 2 2 3 3 Further notes about controllers The Follow and Scalable modes are mutually exclusive a controller for a child application can operate in either Follow mode or Scalable mode but not both The Wizard Tools ensure that each controller s configuration is self consistent U42117 J Z100 4 76 15 How the Wizard Tools provide easy configuration Introduction However a parent application can have more than one child application Since each child has its own controller in the parent each can operate in a different mode For example suppose the teller application in the banking scenario also has an ATM controlled application The database could be configured to operate in Follow mode while the ATM application could be configured to operate in Scalable mode 2 3 How the Wizard Tools provide easy configuration RMS is a mature product with many features and options Experts who develop debug and fine tune complete RMS configurations must know how RMS works and what RMS needs in order to function properly For each application in the configuration the expert must
339. system administrators or in cases where a browser is not available The following sections primarily describe the Cluster Admin GUI options The CLI equivalents are provided in the RMS procedures section 5 2 Using Cluster Admin The following sections discuss how to use the RMS portion of the GUI Windows desktop systems require the Java plug in as specified in the Web Based Admin View Operation Guide 5 2 1 Starting Cluster Admin Open the Java enabled browser use Internet Explorer 5 x or Netscape Navigator 4 x or higher versions and enter the following URL in the Address location U42117 J Z100 4 76 91 Using Cluster Admin Administration http hostname 8081 Plugin cgi The hostname should be the name or IP address of the primary or secondary management server For example if a cluster named FUJI has fuji2 and fuji3 as its primary and secondary management servers the URL would be either one of the following e http fuji2 8081 Plugin cgi e http fuji3 8081 Plugin cgi After contacting the host the browser changes the URL suffix from cgi to html Figure 64 shows an example of the Cluster Admin opening screen For details on the primary and secondary management servers please refer to the Web Based Admin View Operation Guide Server primar 172 25 219 83 Secondary fuji3 Logout NodeList Version BEBPRIMECUUSTERR Global Cluster Services Web Based Admin View tools 3K
340. t Invalid attribute is specified for modification Action Modify only valid attributes e BM 20 77 Dynamic modification failed line linenumber cannot build object lt object gt because its type lt symbol gt is not a user type An object lt object gt of a system type lt symbol gt is specified during dynamic modification Action Use only valid resource types when adding new objects to configuration e BM 21 78 Dynamic modification failed cannot delete object lt object gt because its type lt symbol gt is not a user type An object lt object gt of a system type lt symbol gt is specified for deletion Action Delete only objects that are valid resource types e BM 23 80 Dynamic modification failed The lt Follow gt attribute for controller lt controller gt is set to 1 but the content of a PriorityList of the controlled application lt controlleduserapplication gt is different from the content of the PriorityList of the application lt appname gt to which lt controller gt belongs 234 U42117 J Z100 4 76 Non fatal error messages BM Base monitor This message appears when the PriorityList of the controlled application lt controlleduserapplication gt is different from the content of the PriorityList of the application lt appname gt to which the controller lt controller gt belongs Action Make sure that the PriorityList of the controller and the controlled appli cation is sa
341. t Action None required Starting Reliant Monitor Services now When RMS is starting up this message is printed out Action None required Starting RMS on remote host host now This message will be printed when RMS is being started on the remote host host Action None required startup aborted per user request When RMS is being started up with the c option if the configuration file specified is different from the entry in CONFIG rms RMS asks for confir mation from the user if he wants to activate the different configuration file If the response is no then the aforementioned message is the result Action None required U42117 J Z100 4 76 313 Console messages in alphabetical order Console error messages e The command command could not be executed The execution of the command command failed Action Check to see if the command is available e The command command failed to reset uid information with errno errno errorreason The execution of the command command failed trying to reset effective uid Action Depends on the errno value e The configuration file nondefaultconfig has been specified as option argument of the c option but the Wizard Tools activated configuration is defaultconfig see defaultconfig The base monitor will not be started The desired config uration file should be re activated using the Wizard Tools hvw command This message is shown when th
342. t this message is the result Action Do not attempt to redefine the DetectorStartScript when the detector is already running U42117 J Z100 4 76 227 BAS Startup and configuration errors Non fatal error messages e BAS 9 ERROR IN CONFIGURATION FILE message The message can be any one of the following Check for SanityCheckErrorPrint Object lt object gt cannot have its HostName attribute set since it is not a child of any userApplication Only the direct descendants of userApplication can have the HostName attribute set In basic C parentsCount The node lt node gt belongs to more than one userAppli cation appl and app2 Nodes must be children of one and only one userApplication node The node lt node gt is a leaf node and this type lt type gt does not have a detector Leaf nodes must have detectors The node lt node gt has an empty DeviceName attribute This node uses a detector and therefore it needs a valid DeviceName attribute The rName is lt rname gt its length length is larger than max length maxlength The DuplicateLineInHvgdstartup is lt number gt so the hvgdstartup file has a duplicate line The NoKindSpecifiedForGdet is lt number gt so no kind specified in hvgdstartup Failed to load a detector of kind lt kind gt The node lt node gt has an invalid rkind attribute Nodes of type gResource must have a valid rKind att
343. t ScriptTimeout 300 FaultScript Jusrioptireliant binitools d hvalert ANY ERROR Sysnode fuji2RMS faulted Online it Offline Deact Faulted Unknown Inconsistent stand By Warning OfflineFault ef rms amp pcs sis msg Java Applet Window Figure 69 RMS main window 5 2 4 1 RMS tree The RMS tree displays the configuration information of the cluster in a hierar chical format The tree has the following levels e Root of the tree Represents the cluster e First level Represent the system nodes forming the cluster e Second level Represent the userApplication objects running on each of the system nodes e Third level Represent subapplications if any U42117 J Z100 4 76 97 Using Cluster Admin Administration e Fourth level Represents the resources necessary for each of the subappli cations If an application has subapplications the fourth level represents resources used by that subapplication If an application does not have subapplications then the third level represents all the resources used by the userApplication Dependencies between the applications are depicted in the RMS tree by means of the controller object An example of the RMS tree with a controller object is shown in Figure 70 EA Cluster Admin 01 File Tools Preferences Help ru Attributes MO tyjarms Controller0010f_Ctl_APP2 on fuji2RMS Controller
344. t already exist The directory will be used solely for NFS Lock Failover Therefore if you specify a directory that already exists no other applications will be allowed to use it thereafter From the RMS Main configuration menu select Configuration Edit Global Settings In the Global Settings menu select menu item NFSLockFailover see Figure 8 The directory entered in this screen will be created on all shared U42117 J Z100 4 76 37 Site preparation Using the Wizard Tools interface file systems selected for NFS Lock Failover For example if the directory nfs_lock_dir is entered in this screen and the file system usr testl in userApplication APP1 is selected for NFS Lock Failover then a directory usr testl nfs_lock_dir will be created if it does not already exist and will be used for storing lock information Only one file system per userApp1lication object can be selected for NFS Lock Failover For a more detailed description refer to the HTML documentation for the Fsystem wizard Shared Directory for NFS Lock Failover Currently set 1 HELP 2 FREECHOICE 3 RETURN Global setting Enable NFS Lock Failover Figure 8 NFS Lock Failover screen e The directory entered in this screen must be accessible to all the nodes in the cluster Otherwise NFS failover will not work e This directory is reserved for NFS Lock Failover only i This directory must not be used by any other applications e lf
345. t are running in a cluster e BM 59 Error errno while reading line lt linenumber gt of dob file lt errorreason gt 236 U42117 J Z100 4 76 Non fatal error messages BM Base monitor During dynamic modification the base monitor reads its configuration from a dob file When this file cannot be read this message appears in the switchlog The specific OS error is indicated in errno and errorreason Action Make sure the host conditions are such that dob file can be read without errors e BM 68 Cannot get message queue parameters using sysdef errno lt errno gt reason lt reason gt While obtaining message queue parameters sysdef was not able to communicate them back to the base monitor The values of errno and reason indicate the kind of error Action Contact field support e BM 71 90 Dynamic modification failed Controller lt controller gt has its attribute Follow set to 1 Therefore its attribute IndependentSwitch must be set to 0 and its controlled application lt application gt must have attributes AutoSwitchOver No ControlledSwitch 1 and Controlled Shutdown 1 However the real values are IndependentSwitch lt isw gt AutoSwitchOver lt asw gt ControlledSwitch lt csw gt and ControlledShutdown lt css gt When the controller s attribute Follow is set other attributes such as IndependentSwitchOver AutoSwitchOver Control 1ledSwitch and ControlledShutdown mus
346. t configuration Starting RMS from the main menu RMS Start Menu for all nodes RMS Start Menu for individual nodes Starting RMS on individual nodes Stopping RMS o 104 105 106 107 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 126 127 128 129 130 382 U42117 J Z100 4 76 Figures Figure 102 Figure 103 Figure 104 Figure 105 Figure 106 Figure 107 Figure 108 Figure 109 Figure 110 Figure 111 Figure 112 Figure 113 Figure 114 Figure 115 Figure 116 Figure 117 Figure 118 Figure 119 Figure 120 Figure 121 Stopping RMS on all available nodes 131 Stopping RMS on one node from the list 132 Using command pop up to stop RMS 133 Starting an application 134 Switching an application 136 Switching a busy application 137 Shutting down an application 138 Clearing an application fault 140 Clusterwide environment variables 142 Local environmental variables pop up 143 Local environmental variables window 144 Displaying application states 145 Invoking the log Viewer 174 Find pop up WINdOW o o 175 Detached log o o 176 Resource based search o 177 Results of time based search
347. t exists but cannot be executed Action make InitScript executable e INI 7 sysnode must be in your configuration file If the local SysNode sysnode is not part of the configuration file this message is the result and RMS exits with exit code 23 U42117 J Z100 4 76 287 INI init script Fatal error messages Action Make sure that the local SysNode sysnode is part of the configuration file e INI 10 InitScript has not completed within the allocated time period of timeout seconds InitScript was still running when the time period allocated for its execution has expired The timeout period is the least of the values defined in the environment variable SCRIPTS_TIME_OUT in the hvenv file or 300 Action Increase the timeout value or correct the conditions lead to timeout during script execution e INI 11 InitScript failed to start up errno errno reason reason An error occurred during startup of InitScript The errno code lt errno gt and reason lt reason gt are presented in the message Action Correct the erroneous host condition for InitScript to be able to start up e INI 12 InitScript returned non zero exit code exitcode InitScript completed with a non zero exit code lt exitcode gt Action Correct the erroneous host condition for InitScript to be able to return a zero exit code or fix the InitScript itself e INI 13 InitScript has been stopped InitScript has been stopped
348. t have the values O No 1 and 1 respec tively However this condition is violated in the configuration file Action Supply a valid combination of attributes for the controller and its controlled user application e BM 72 91 Dynamic modification failed Controller lt controller gt with the lt Follow gt attribute set to 1 belongs to an application lt application gt which PersistentFault is lt appfault gt while its controlled application lt controlledapplication gt has its PersistentFault lt _fault gt U42117 J Z100 4 76 237 BM Base monitor Non fatal error messages If controller has its Follow set to 1 then all its controlled applications must have the same value for the attribute PersistentFault as the appli cation where the controller belongs to Action Check and correct the configuration e BM 73 The RMS CF interface is inconsistent and will require operator intervention The routine routine failed with error code errorcode errorreason This is a generic message indicating that the execution of the routine routine failed due to the reason errorreason and hence the RMS CF interface is inconsistent Depending on which routine routine has failed the base monitor can exit with any one of the exit codes 132 133 134 135 136 137 138 or 95 Action Contact field support e BM 74 The attribute DetectorStartScript and hvgdstartup file cannot be used together The hvgdstartup file is
349. t object gt that is going to be linked or unlinked will be either deleted or unlinked from all applications Any attempt to perform the operations of deleting an object lt object gt from the RMS resource graph and then trying to unlink it from its parent object or vice versa results in dynamic modification being aborted and the above message being printed out to the switchlog Action Make sure that the operations of deletion and unlinking are not performed on an object at the same time e ADM 25 1 Dynamic modification failed parent object lt parentobject gt is absent When a new object is being added to an existing configuration it should have an existing object lt parentobject gt as its parent if not then dynamic modification is aborted and the message is printed to the switchlog U42117 J Z100 4 76 211 ADM Admin command and detector queues Non fatal error messages Action Make sure that the parent specified for a new object that is being added is existent e ADM 26 18 Dynamic modification failed parent object lt parentobject gt is neither a resource nor an application When a new object is being added to an existing configuration if the parent object lt parentobject gt that has been specified is not a resource it leads to dynamic modification aborting and the message being printed Dynamic modification is aborted Action Make sure that the parent object specified for a new object is a re
350. t_node can exit with any one of the exit codes 132 133 134 135 136 137 138 or 95 Action U42117 J Z100 4 76 257 NOD Node detector Non fatal error messages Contact field support e NOD 30 detector message get doesn t work in hvdet_node When the hvdet_node contacts the base monitor to get the list of SysNodes if it finds that it could not get the list of SysNodes even after trying 10 times this message is printed to the switchlog This means that there is some problem with the message queues between the hvdet_node and the base monitor RMS then exits with exit code 129 Action Contact field support e NOD 31 detector nodename nodename not in NODELIST This message indicates a severe malfunction in RMS when the detector lt detector gt cannot find the node lt nodename gt in its list of nodes Action Contact field support e NOD 33 The interface interface connection to the cluster host host failed Action Critical error Contact field support e NOD 34 detector Failed to call osd select errorreason If the detector hvdet_node fails during the system call select while reading messages this message is printed to the switchlog along with the reason lt errorreason gt The detector then exits with exit code 131 Action Contact field support e NOD 37 Child hvdet_node died Will try to restart hvdet_node Action None required 258 U42117 J Z100 4 76 Non fatal error
351. tchlog file in a detached window 104 U42117 J Z100 4 76 Administration Using Cluster Admin Display the application log by right clicking on an online application on the RMS tree and choosing View Logfile Figure 78 enel on gamers A an on fuji2RMS Time Filter O aa E DE aa eo E MEP oo me Keyword Filter Resource Name No Selection x Severity No Selection Y Non zero exit code E Keyword A Fiter PS app2 PreCheck 2003 09 16 08 21 42 NOTICE Processing prechecks for application app2 ManageProgram000_Cmd_APP2 Online 2003 09 16 08 21 47 NOTICE starting touch imp app2 ManageProgram000_Cmd_APP2 Online 2003 09 16 08 21 47 ManageProgram000_Cmd_APP2 E Status Done Detach Remove Help Figure 78 Viewing the application log By default the entire log is available in the scrolled area at the bottom of the window You can restrict the entries displayed with the following filters e Timestamp Click the Enable check box and select the period of interest e Resource name severity of error messages non zero exit code or keyword Selected and non blank criteria are combined with a logical and Refer to the RMS Troubleshooting Guide for a complete description of severity levels and exit codes Click the Filter button to display the filtered log entries Figure 79 shows the screen for a search based on the dat
352. tectors For information on the availability of the RMS Wizard Kit contact your local customer support service or refer to the RMS Wizards documen tation package U42117 J Z100 4 76 19 Cluster Admin Introduction 2 5 Cluster Admin The Cluster Admin GUI is the primary administrative tool for RMS For RMS it allows users full access to the application control functions of RMS including the following e Application startup e Application shutdown e Manual application switchover e Visual cues for resource and application fault isolation e Fault clearing capability e RMS startup e RMS shutdown e Graphs of application and resources 2 6 RMS components The RMS product is made up of the following software components that run on each node in the cluster e Base monitor e Detectors e Scripts e RMSCLI 2 6 1 Base monitor The base monitor process is the decision making segment of the RMS process group It has the following functions e Stores the current configuration of resources as depicted by objects their attributes and their interdependent relationships e Receives requests from the RMS command line interface CLI to take actions 20 U42117 J Z100 4 76 Introduction RMS components e Receives input from detectors that report state changes e Launches scripts to bring applications and their dependent resources Online or Offline e Dictates the sequencing of the resource state changes to ensure res
353. ter configuration formation on a PRIMECLUSTER node 15 2 CF System administration cfconfig configure or unconfigure a node for a PRIMECLUSTER cluster cfset apply or modify etc default cluster config entries into the CF module cftool print node communications status for a node or the cluster U42117 J Z100 4 76 349 CFS Appendix List of manual pages 15 3 CFS fsck_rcfs file system consistency check and interactive repair mount_rcfs mount RCFS file systems rcfs_fumount force unmount RCFS mounted file system rcfs_list list status of RCFS mounted file systems rcfs_switch manual switchover or failover of a RCFS file system ngadmin node group administration utility cfsmntd cfs mount daemon for RCFS 15 4 CIP System administration cipconfig start or stop CIP 2 0 ciptool retrieve CIP information about local and remote nodes in the cluster File format Cip CF CIP configuration file format 350 U42117 J Z100 4 76 Appendix List of manual pages Monitoring Agent 15 5 Monitoring Agent System administration clrcimonctl Start stop or restart of the RCI monitoring agent daemon and display of daemon presence clrccumonctl Start stop or restart of the console monitoring agent daemon and display of daemon presence clrccusetup Registers changes deletes or displays console information 15 6 PAS System administration mipcstat MIPC statistics clmstat CLM statist
354. ternal data structure If the data structure has run out of slots 16 to put the SysNode name in this message is printed out Action Contact field support WRP 15 gethostbyname hostname host name should be in etc hosts When the hostname hostname specified as a SysNode does not have an entry in etc hosts this message is printed out to the switchlog Action Correct the host name hostname to be an entry in etc hosts WRP 16 No available slot for host hostname When RMS has run out of slots for the cluster interfaces 64 this message is printed along with the host name hostname for which this happened Action Contact field support WRP 17 Size of integer or IP address is not 4 bytes U42117 J Z100 4 76 275 WRP Wrappers Non fatal error messages Critical internal error Action Contact field support e WRP 18 Not enough memory in lt processinfo gt Action Critical error Contact field support e WRP 23 The child process lt cmd gt with pid lt pid gt could not be killed due to errno lt errno gt reason reason The child process with pid pid could not be killed due to reason reason Action Take action based on the reason reason e WRP 24 Unknown flag option set for kil1Child The killChild routine accepts one of the 2 flags KILL_CHILD and DONTKILL_CHILD If an option other than these two has been specified this message is the result Action Please contact
355. th an actual value May be enclosed in angle brackets to emphasize the difference from adjacent text e g lt nodename gt RMS unless directed otherwise you should not enter the angle brackets The name of an item in a character based or graphical user interface This may refer to a menu item a radio button a checkbox a text input box a panel or a window title Bold Items in a command line that you must type exactly as shown Typeface conventions are shown in the following examples 1 3 1 5 Example 1 Several entries from an etc passwd file are shown below U42117 J Z100 4 76 5 Conventions Preface root x 0 1 0000 Admin 0000 sbin ksh sysadm x 0 0 System Admin usr admin usr sbin sysadm setup x 0 0 System Setup usr admin usr sbin setup daemon x 1 1 0000 Admin 0000 1 3 1 6 Example 2 To use the cat 1 command to display the contents of a file enter the following command line cat file 1 3 2 Command syntax The command syntax observes the following conventions Symbol Name Meaning Brackets Enclose an optional item Braces Enclose two or more items of which only one is used The items are separated from each other by a vertical bar I Vertical bar When enclosed in braces it separates items of which only one is used When not enclosed in braces it is a literal element indicating that the output of one program is piped to the input of another
356. that timeout is large enough to execute the script U42117 J Z100 4 76 271 WRP Wrappers Non fatal error messages e WLT 3 Cluster host hostname s Shutdown Facility invoked via script has not finished in the last time seconds An operator intervention is required The Shutdown Facility that is killing host hostname has not terminated yet Operator intervention may be required This message will appear period ically with the period equal to the node s ScriptTimeout value until either the script terminates on its own or until the script is terminated by the Unix kil 1 command If terminated by the ki11 command the host being killed will not be considered killed Action Wait until the script terminates or terminate the script using ki11 command if the script cannot terminate on its own WLT 5 CONTROLLER FAULT Controller lt object gt has propagated lt request gt request to its controlled application s lt applica tions gt but the request has not been completed within the period of lt timeout gt seconds When controller propagates its requests to the controlled applications it is waiting for the completion of the request for a period of time sufficient for the controlled applications to process the request When the request if not completed within this period controller faults Action Fix the controller s scripts and or scripts of the controlled applications or repair resources of the controlled applica
357. the resource offline and CheckScript to check the state of a resource Controller Configures applications that control other applications Fsystem Configures local or remote file systems Gds Configures disk classes administrated by Global Disk Services GDS Gls Configures the IP addresses administrated by Global Link Services GLS lpaddress Configures the IP addresses that are needed for communication over a LAN interface Revm Configures disk groups administrated by the PRIMECLUSTER Volume Manager not available in all areas Vxvm Configures disk groups administrated by the VERITAS volume manager not available in all areas U42117 J Z100 4 76 33 Site preparation Using the Wizard Tools interface 3 2 Site preparation The PRIMECLUSTER Installation Guide Solaris Linux describes how to prepare your cluster to operate RMS Some of the procedures require you to modify system files so that RMS can identify the hosts file systems and network inter faces used in a configuration You should have completed these procedures when RMS was installed In some cases you will be creating or modifying your RMS configuration because changes have been made to your site Certain site changes may require you to review and update your system files first These changes include but are not limited to the following e IP addresses were changed e Redundant interconnects were added to the cluster e Hosts were a
358. the directory entered by the user begins with a slash character this character is dropped before creating the usr test1 nfs_lock_dir directory e Reserve one IP address for each userApp1l ication object from which all the local file systems set with NFS Lock Failover must be shared 3 2 3 File systems Linux only e etc fstab Contains entries for all of the local file systems that are to be used as resources in the configuration In other words this file describes the file systems that need to be mounted locally For each file system to be managed by RMS create a line with the standard fstab fields and then insert the string RMS at the beginning of the line For more information see the fstab manual page 38 U42117 J Z100 4 76 Using the Wizard Tools interface Site preparation Example RMS dev sdb2 fs2 ext2 defaults 1 2 e etc exports Contains entries for all file systems that are available for mounting on other hosts For each file system to be managed by RMS create a line with the standard exports fields and then insert the string RMS at the beginning of the line For more information see the exports manual page Example RMS usr fuji Crw 3 2 4 Log files e var adm messages Solaris or var log messages Linux By default all RMS messages go to both the system log messages and the RMS switch1og file If you do not want to send messages to the system log then set HV_SYS_LOG_USE
359. the monitored resources and establishes the interdependencies between them The default name of this file is config us console See single console custom detector RMS See detector RMS 360 U42117 J Z100 4 76 Glossary custom type RMS See generic type RMS daemon A continuous process that performs a specific function repeatedly database node SIS Nodes that maintain the configuration dynamic data and statistics in a SIS configuration See also gateway node SIS service node SIS Scalable Internet Services SIS detector RMS A process that monitors the state of a specific object type and reports a change in the resource state to the base monitor directed switchover RMS The RMS procedure by which an administrator switches control of a userApplication over to another node See also automatic switchover RMS failover RMS SIS switchover RMS symmetrical switchover RMS distributing a configuration RMS The process of copying a configuration file and all of its associated scripts and detectors to all nodes affected by the configuration This is normally done automatically when the configuration is activated using PCS the RMS Wizards or the CLI See also activating a configuration RMS generating a configuration RMS DOWN CF A node state that indicates that the node is unavailable marked as down A LEFTCLUSTER node must be marked as DOWN before it can rejoin a cluster
360. the pre online request process is as follows U42117 J Z100 4 76 153 Online processing Advanced RMS concepts 1 Request is sent from the parent to the child Parent node changes to the Wait state but no script is initiated Child receives the request The pre online script is initiated in the leaf nodes When the script terminates confirmation is sent to the parent a fF N As soon as all children of the parent have sent their confirmation the pre online script is executed on the parent In relation to the resource graph the above steps illustrate the bottom up procedure for executing the scripts in online processing The userApp1ication node is the final node to execute its pre online script it then generates an online request which is passed to the leaf nodes However there is a difference between online processing and pre online processing Relative to the resource graph the online script process is as follows 1 RMS executes the online script 2 The system waits until the node detector signals the On1 ine state If anode does not have a detector the post online script executes after the On ine Script is completed successfully 3 The post online script executes immediately 4 Confirmation of the success of online processing is forwarded to the parent 5 The node exits the Wait state and changes to the On1 ine state As shown previously leaf nodes in an RMS configuration require at least an OnlineScript
361. tion offline 138 Activating an application 138 Clearing afault o e i 140 Clearing a sysnode Wait state 140 Displaying environment variables 142 Displaying application states 145 Viewing the switchlog o o 145 Viewing application l0OgS o 146 Viewing GUI Messages o 146 Advanced RMS concepts 147 Internal organization o 147 Configuration structure o o 147 Resource description o o ee eee 148 Messages anae pe A o jala 149 U42117 J Z100 4 76 Contents 6 1 4 State transition rules 149 6 2 States and scripts e 150 6 3 INIHQlZINO 0 occ a ai 151 6 4 Online processing ee 152 6 4 1 Online request o e e o 152 6 4 1 1 Manual methods o o 152 6 4 1 2 Automatic methods 0 o 153 6 4 2 Online processing in a logical graph of a userApplication 153 6 4 3 PreCheckScript o e e e 155 6 4 4 Fault situations during online processing 155 6 4 5 userApplication is already online 156 6 5 Offline processing 2 2 157 6 5 1 Offline request 2 00022 eee ee 157 6
362. tions For user defined controller scripts increase their ScriptTimeout values 8 21 WRP Wrappers WRP 1 Failed to set script to TS The script could not be made into a time sharing process Action Take action based on the reason WRP 2 Illegal flag for process wrapper creation Action Critical error Contact field support 272 U42117 J Z100 4 76 Non fatal error messages WRP Wrappers e WRP 3 Failed to execv command This message could occur in any of the following scenarios A detector cannot be started because RMS is unable to create the detector process with the command command hvcm a has been invoked and the RMS base monitor cannot be started on the individual hosts comprising the cluster with the command command A script cannot be started because RMS is unable to create the script process with the command command RMS shuts down on the node where this message appears and returns an error number errno which is the error number returned by the operating system Action Consult the system manual pages or the appendix of this manual for the explanation for error number errno and see if the cause is evident If not contact field support e WRP 4 Failed to create a process command This message could occur in any of the following scenarios A detector cannot be started because RMS is unable to create the detector process to execute the command command hvcm a
363. to send the kill success message to the cluster host host When a cluster host is killed the host requested the kill must send a success message to the surviving hosts This message appears in the switchlog when this message send fails Action Make sure the cluster and network conditions are such that the message can be sent across the network SYS 8 RMS failed to shut down the host host via the Shutdown Facility no further kill functionality is available The cluster is now hung This message appears when the RMS was sending a kill request to the Shutdown Facility and did not get the elimination acknowledgement Action Refer to the manuals of the ShutDown Facility to find out what was going wrong with the host elimination Check the actual status of the remote host and invoke the appropriate hvutil u or hvutil o command to resolve the RMS hang state SYS 13 Since this host lt hostname gt has been online for no more than time seconds and due to the previous error it will shut down now This message appears when the checksum of this host is different from the hosts in the cluster one of the possible reasons Action Check the configuration in all the cluster hosts and verify that same configuration is running on all of them 262 U42117 J Z100 4 76 Non fatal error messages SYS SysNode objects SYS 14 Neither automatic nor manual switchover will be possible on this host until lt detector gt det
364. tor log files 173 node names in configuration files 34 35 nodes 11 NoDisplay attribute 338 non basic settings wizards 47 NullDetector attribute 338 O object types andOp 323 controller 323 ENV 323 ENVL 323 gResource 323 orOp 323 SysNode 323 userApplication 323 objects 392 U42117 J Z100 4 76 Index activating applications 139 attributes 26 99 clearing a fault 140 clusterwide table 119 command pop ups 114 122 Controller 12 controller 98 112 dependencies 108 graph customization 115 information 113 relationships 108 RMS full graph 108 RMS tree 100 selecting 100 starting an application 134 switching applications 136 SysNode 11 100 taking application offline 138 types 26 userApplication 11 100 133 offline request 157 offline processing 11 definition 157 fault situations 158 offline scripts 19 Offline state 21 OfflineDoneScript 23 OfflineScript 23 PostOfflineScript 23 offline state 11 OfflineDoneScript attribute 338 script 23 OfflineFault state 21 OfflineScript script 23 OfflineScript attribute 328 online processing 11 online scripts 19 Online state 21 OnlineScript 23 PostOnlineScript 23 PreCheckScript 22 PreOnlineScript 23 online state 11 OnlinePriority attribute 329 OnlineScript attribute 329 script 23 OnlineTimeout attribute 338 operator intervention 165 operator privileges 93 orOp object type 323 P parallel application support 9 PartialCluster attribute 330 PAS comma
365. transi tioned from the Online state e SYS 97 Cannot access the NET_SEND_Q queue When a new host comes Online the other hosts in the cluster try to determine if the new host has been started with C option The host that has just come online uses the queue NET_SEND_Q to send the necessary information to the other hosts in the cluster If this host is unable to access the queue NET_SEND_Q this message is printed Action Contact field support e SYS 98 Message send failed in SendJoinOk When a new host comes Online the other hosts in the cluster try to determine if the new host has been started with C option The host that has just come online uses the queue NET_SEND_Q to send the U42117 J Z100 4 76 265 UAP userApplication objects Non fatal error messages necessary information to the other hosts in the cluster If this host is unable to send the necessary information to the other hosts in the cluster this message is printed Action Check if there is a problem with the network 8 18 UAP userApplication objects UAP 1 Request to go online will not be granted for appli cation lt appname gt since the host lt sysnode gt runs a different RMS configuration This message appears when the request is done for an application lt appname gt to go Online but the host lt sysnode gt is running a different configuration Action Make sure that the user is running the same configuration UAP 5 object
366. ts The number of seconds to wait before reporting a state change after the child application transitions out of Standby state If Fol lowis set to 1 StandbyTimeout value must be set to 0 The user can modify this attribute for a cmd1 ine subapplication only The configuration tools control this attribute for all other subapplications e StandbyTransitions Possible Values StartUp SwitchRequest ClearFaultRequest Default empty Valid for userApplication objects The value specified determines the standby transitions that are to be executed StartUp means that at startup the application is requested to go to the Standby state unless it is already Online or unless it is forced to go Online due to the AutoStartUp attribute SwitchRequest means that after application switchover the application that was Online before the switchover will transition to the Standby state ClearFaultRequest means that the application is requested to go to the Standby state after a Faulted state was cleared with nvutil c e StateChangeScript Possible Values Valid script character Default empty Valid for Scalable controller objects Specifies the script to be executed upon state transitions of the either the child applications or the SysNode objects where the child applications can run The script is executed once each time a child application transitions into one of the states Online Offline Faulted or Standby even if the
367. ts on the basis of changes to its state such as a change over to the Faulted state Requests always emanate from the userApp1i cation and are forwarded from the parent to the child top down The processing of state change messages between Offline and Online differs as follows e State change to 0ff1ine Offline processing is top down for example a mount point is unmounted first and then the underlying device is decon figured e State change to 0n1ine Online processing is bottom up While the online request travels down the tree from the userApp1 ication to the leaf object s RMS executes the actual state change bottom up for example first RMS configures the mirror than it mounts the file system on the mirror 6 1 4 State transition rules RMS uses state transition rules to define which messages requests or state changes trigger what reaction in what situation The fundamental concept is clarified in the following description of RMS procedures U42117 J Z100 4 76 149 States and scripts Advanced RMS concepts 6 2 States and scripts The chapter Introduction on page 9 introduced the concepts of scripts and the functions they perform Scripts are divided as follows e Request triggered scripts Designed to produce a change in a state e State triggered scripts Represent a reaction to a specific state Request triggered scripts are as follows e InitScript e PreOnlineScript e PreOfflineScript e PreCh
368. ttribute 336 IgnoreOnlineRequest attribute 337 IgnoreStandbyRequest attribute 337 include directory 29 Inconsistent state 22 IndependentSwitch attribute 337 initial state initializing 151 Unknown state 151 initialization script specifying 348 U42117 J Z100 4 76 391 Index initialization error at 157 InitScript 22 interfaces alternate 35 73 IP addresses defining resources 16 resource wizard 33 Ipaddress resource wizard 33 K killing a node 12 L LAN interfaces 36 LEFTCLUSTER 164 lib directory 29 LieOffline attribute 338 local environment variables 27 log files application 103 base monitor 172 interpreting 185 node detector 173 specify directory 344 switchlog 165 173 time of preservation 344 viewing 146 log levels specifying 182 log messages wizards 187 logging in Cluster Admin 93 M MA commands clrecumonctl 351 clrccusetup 351 clrcimonctl 351 main menu wizards 42 maketrusted 351 making forced online requests 163 management server 92 manual pages display 349 listing 349 market specific applications 10 MaxControllers attribute 328 messages 149 base monitor 172 bmlog 172 debug 171 180 error 171 generic detector log 173 node detector 173 troubleshooting RMS 191 wizards 187 messages error console 295 fatal 281 switchlog 195 MonitorOnly attribute 328 mount_rcfs 350 N naming conventions RMS 35 59 networks site preparation 34 ngadmin 350 node killing 12 node detec
369. uji3 root e opt SMAW SMAWRrms etc hvipalias Contains entries for all of the LAN interfaces that are to be used as resources in the configuration The entries must provide the names and netmasks that are required for the LAN Optionally there may also be some routing information See the online document Ipaddress htm or the header of the hvipalias file for the format of the entries Example uname n IfName Interface s Netmask Routes fuji2 045dial ethl Oxf ffffFf00 e opt SMAW SMAWRrms etc hvconsoles Controls customized handling of fault messages Each entry specifies a program to be executed when an RMS resource object encounters a fault If the file does not exist you will receive no fault information A complete description of the format is available in the hvcon soles online manual or in the comments in the hvconsoles file Example ANY fuji2 echo GENERAL_ALERT_ARG 3 2 2 File systems Solaris only e etc vfstab Contains entries for all of the local file systems that are to be used as resources in the configuration In other words this file describes the file systems that should be mounted locally RMS entries appear as comments and will be ignored by all processes other than PRIMECLUSTER compo nents For more information see the vfstab manual page 36 U42117 J Z100 4 76 Using the Wizard Tools interface Site preparation Example FRMSA dev dsk cOt0d0sO dev rdk cOt0d0sO testfsl ufs 1 yes etc dfs
370. ult none Valid for gResource objects Specifies a string to be forwarded to the generic detector U42117 J Z100 4 76 339 Attributes managed by configuration wizards Appendix Attributes e SplitRequest Possible Values O 1 Default 0 Valid for controller objects If set to 1 then PreOff1ine and Preonline requests will be propagated to child applications separately from the Offline and Online requests If 0 then separate PreOffline or PreOnl ine requests will not be issued for the child applications Also if 0 then only Offline and Online requests will be propagated if Ignore0f flineRequest and IgnoreOnlineRequest respectively are set to 0 340 U42117 J Z100 4 76 14 Appendix Environment variables You can change the RMS environment by modifying the appropriate entries in the hvenv Tocal file and restarting RMS Caution RMS environment variables cannot be set in the user environment explicitly Doing so can cause RMS to lose environment variables settings Refer to the Reliant Monitor Services RMS Configuration and Administration Guide for more information about environment variables 14 1 Global environment variables Global variable settings ENV are included in the configurations checksum that is common to the cluster The checksum is verified on each node during startup of the base monitor RMS will fail to start if it detects a checksum difference between the values on any two nodes Th
371. ult caused provided that offline processing is successful even if the AutoSwitchOver attribute is not set Switchover had U42117 J Z100 4 76 161 Fault processing Advanced RMS concepts evidently been requested at this time by the system administrator who sent the directed switch request online The target host of the switchover procedure may not be the host with the highest priority it is the host explicitly specified in the directed switch request 6 6 2 Offline faults Even if a userApplication is not online on a host RMS still monitors the nodes configured in the graph of the userApp1ication If a detector indicates a fault in such a node the fault is displayed However no processing takes place the fault script is not executed and no message is sent to the parent In this case it is possible that an AND node could be offline although one of its children is Faul ted RMS contains this design on the basis that the mandatory dependency corre lation between the nodes in a userApp1ication graph exist only if the userAppl ication is to run In the offline case RMS treats the nodes as individual instances and does not evaluate their mutual interdependencies However an exception occurs when the ClusterExclusive attribute is set Ifa userApplication is offline has the ClusterExclusive attribute set and has children that are not offline then hvdisp will display the state Inconsistent instead of Of f1 ine for this userAppl
372. ult timeout for the hvshut command expired and some of the hosts are still running Action Adjust the default timer by setting RELIANT_SHUT_MIN_WATT to a value which is large enough to allow a shutdown on all hosts Check if shutdown fails for internal problems e g a failure of an OfflineScript cause an userApplication to fail to go Offline e SYS 90 hostname internal WaitList addition failure Cannot set timer for delayed detector report action 264 U42117 J Z100 4 76 Non fatal error messages SYS SysNode objects System Error Action Contact field support e SYS 93 The cluster host nodename is not in the Wait state The hvutil command request failed This message appears when the user issues the hvutil command hvutil o or hvutil u and the cluster host lt nodename gt is not in the Wait state Action Reissue hvutil o u only when the host is in a Wait state e SYS 94 The last detector report for the cluster host hostname is not online The hvutil command request failed This message appears when the user issues the hvuti1 command hvutil o sysnode to clear the Wait state of the SysNode and the Sy sNode is still in Wait state because the last detector report for the cluster host lt hostname gt is not Online i e the SysNode might have transi tioned to Wait state not from Online but from some other state Action Issue hvutil o only when the host is in a Wait state that has
373. un once for each user application or node e RELIANT_STARTUP_PATH Possible values any valid path Default lt RELIANT_PATH gt build Defines where RMS searches at start time for the configuration files e SCRIPTS_TIME_OUT Possible values O MAXINT Default 300 Seconds Specifies the global period in seconds within which all RMS scripts must be terminated If a specific script cannot be terminated within the defined period it is assumed to have failed and RMS begins appropriate processing for a script failure If this value is too low error conditions will be produced unnecessarily and it may not be possible for the applications to go online or offline An exces sively high value is unsuitable because RMS will wait for this period to expire before assuming that the script has failed In case the global setting is not appropriate for all objects monitored by RMS this global value can be overridden by an object specific setting of the ScriptTimeout attribute 348 U42117 J Z100 4 76 15 Appendix List of manual pages This appendix lists the online manual pages for CCBR CF CFS CIP Monitoring Agent PAS PCS RCVM Resource Database RMS RMS Wizards SCON SF SIS and Web Based Admin View To display a manual page type the following command man man_page_name 15 1 CCBR System administration cfbackup save the cluster configuration information for a PRIMECLUSTER node cfrestore restore saved clus
374. unning different configuration than the local host or different loads of RMS package are installed on these hosts U42117 J Z100 4 76 263 SYS SysNode objects Non fatal error messages Action Make sure all the hosts are running the same configuration and the configuration is distributed on all hosts Make sure that same RMS package is installed on all hosts same load e SYS 49 Since this host lt hostname gt has been online for more than time seconds and due to the previous error it will remain online but neither automatic nor manual switchover will be possible on this host until lt detector gt detector will report offline or faulted This message appears when the checksum of this host is different from the hosts in the cluster one of the possible reasons Action Check the configuration in all the cluster hosts and verify that same configuration is running on all of them e SYS 50 Since this host lt hostname gt has been online for no more than time seconds and due to the previous error it will shut down now This message appears when the checksum of this host is different from the hosts in the cluster one of the possible reasons Action Check the configuration in all the cluster hosts and verify that same configuration is running on all of them e SYS 84 Request lt hvshut a gt timed out RMS will now terminate Note some cluster hosts may still be online This message appears when the defa
375. uration Consistency Report 10 Configuration ScriptExecution 11 Configuration Push 12 RMS ViewMachine Choose an action Figure 10 Main configuration menu when RMS is running Configuration Push provides the capability to update push the running configu ration to another node that needs updating For example if one cluster node were down for maintenance and you updated the RMS cluster configuration in the meantime you could use Configuration Push to update the node that was down for maintenance Item 12 RMS ViewMachine replaces the menu items that allow changes to the configuration when RMS is inactive 3 4 3 Secondary menus Each of the main menu items has a number of secondary menus The secondary menus themselves can have sub menus The Creation Application type selection menu Figure 11 is an example of a secondary menu You see this menu after selecting Application Create from the main menu 46 U42117 J Z100 4 76 Using the Wizard Tools interface Creating and editing a configuration Creation Application type selection menu 1 HELP 2 QUIT 3 RETURN 4 OPTIONS 5 DEMO 6 GENERIC 7 LIVECACHE 8 R3ANY 9 R3CI 10 RTP Application Type 5 Figure 11 Application type selection This option allows you to select an application type to be assigned to the appli cation in question This is an important step in the configuration procedure since itinvokes the
376. urce Resource Wizard hvw Resource Resource database specific specific script detector RMS RMS Node state A po base monitor detector A Figure 7 Relationship between RMS and RMS Wizards 18 U42117 J Z100 4 76 Introduction How RMS Wizards provide easy configuration 2 4 1 RMS Wizard Tools The RMS Wizard Tools provides the following for basic resource types such as file systems and IP addresses e Online scripts e Offline scripts e Detectors In addition to the basic resource support the RMS Wizard Tools package contains the hvw command which is the entry point to the user configuration interface The hvw interface provides a simple menu driven interface to allow a user to enter information specific to applications placed under the control of RMS hvw also provides an interface through which application specific knowledge can be dynamically added to provide turnkey solutions for those applications typically found in the data center These application specific modules are provided by the RMS Wizard Kit 2 4 2 RMS Wizard Kit The RMS Wizard Kit provides application knowledge modules which can be used by the hvw command The knowledge modules provide hvw with infor mation specific to popular applications which greatly eases the configuration task The following are also provided for specific applications e Online scripts e Offline scripts e De
377. us purposes for example SNMP agent This value is supplied by the configuration wizards e Comment Possible Values any string Default empty Valid for all objects Used for documentation in the configuration file no functional meaning within RMS U42117 J Z100 4 76 335 Attributes managed by configuration wizards Appendix Attributes e ControlledShutdown Possible Values O 1 Default 0 Valid for controlled userApp1ication objects If set to 1 RMS does not send an Offline request to this application because an explicit request will be generated by a parent application during its offline processing e ControlledSwitch Possible Values O 1 Default 0 Valid for controlled userApp1ication objects If set to 1 the application is the child of a Fol 1 ow controller e DetectorStartScript Possible Values Any valid detector start script Default empty Valid for resource object with detector Specify the detector start command directly in the lt configname gt us file e HostName Possible Values Any SysNode name Default empty Must be set only in the first level andOp children of a userApplication object Each of these andOp objects associates its parent application with the SysNode specified in its HostName attribute the child andOp objects also determine the priority of the application s nodes e Ignore0fflineRequest Possible Values O 1 Default 1 Valid for contro
378. users do not have access to the Administrative LAN it provides an extra level of security The use of an Administrative LAN is optional See also public LAN U42117 J Z100 4 76 357 Glossary API See Application Program Interface application RMS A resource categorized as a userApplication used to group resources into a logical collection Application Program Interface A shared boundary between a service provider and the application that uses that service application template RMS A predefined group of object definition value choices used by the RMS Wizard Kit to create object definitions for a specific type of application attribute RMS The part of an object definition that specifies how the base monitor acts and reacts for a particular object type during normal operations automatic switchover RMS The procedure by which RMS automatically switches control of a userApplication over to another node after specified conditions are detected See also directed switchover RMS failover RMS SIS switchover RMS symmetrical switchover RMS availability Availability describes the need of most enterprises to operate applica tions via the Internet 24 hours a day 7 days a week The relationship of the actual to the planned usage time determines the availability of a system base cluster foundation CF This PRIMECLUSTER module resides on top of the basic OS and provides internal interfaces for the CF Clu
379. ut to the system log from the RMS base monitor RMS always records RMS ERROR FATAL ERROR WARNING and NOTICE messages in the RMS switchlog file By default these messages are duplicated in the system log file var adm messages Solaris or var log messages Linux To disable RMS messages in the system log set HV_SYSLOG_USE 0 in hvenv local RELIANT_HOSTNAME Possible values valid name Default nodenameRMS The name of the local node in the RMS cluster The default value of this variable is the node name with an RMS suffix for example fuji2RMS as generated by the following command export RELIANT _HOSTNAME cftool 1 2 gt dev null tail 1 cut f1 d RMS U42117 J Z100 4 76 347 Local environment variables Appendix Environment variables If this preset value is not suitable it must be modified accordingly on all nodes in the cluster The specified cluster node name must correspond to the SysNode name in the configname us configuration file The node name determines the IP address that RMS uses for establishing contact with this node RELIANT_INITSCRIPT Possible values any executable Default lt RELIANT_PATH gt bin InitScript Specifies an initialization script to be run by RMS when the system is started This variable is not set by default This script is run before any other processes are activated It is a global script that is run once on every cluster node on which it is defined and is not r
380. variables The default values of the environment variables are found in lt RELIANT_PATH gt bin hvenv They can be redefined in the hvenv local command file The following list describes the local environment variables for RMS e HV_CONNECT_TIMEOUT Possible values 0 MAXINT Default 0 seconds Users do not normally need to change the default setting The maximum time in seconds that the node detector hvdet_node uses for connections to all remote cluster nodes before assuming that the connection attempt has failed U42117 J Z100 4 76 345 Local environment variables Appendix Environment variables e HV_LOG_ACTION Possible values on Off Default off Determines if the current log files in the directory RELIANT_LOG_PATH are deleted if the used space on the file system is larger or equal to HV_LOG_ACTION_THRESHOLD e HV_MAX_HVDISP_FILE_SIZE Possible values O MAXINT Default 20 000 000 bytes Prevents the unlimited growth of the temporary file that RMS uses to supply hvdisp with configuration data and subsequent configuration and state changes The value of this variable is the maximum size in bytes of the temporary file lt RELIANT_PATH gt 10cks rms lt process id of the hvdisp process gt e HV_MAXPROC Possible values O fork limit Default 30 Defines the maximum number of scripts RMS can have forked at any time The default 30 is sufficient in most cases e HV_RCSTART Possible values 0
381. virtual disk single console The workstation that acts as the single point of administration for nodes being monitored by RMS The single console software SCON is run from the single console SIS See Scalable Internet Services SIS state See resource state RMS Storage Area Network The high speed network that connects multiple external storage units and storage units with multiple computers The connections are generally fiber channels striped virtual disk RCVM Striped virtual disks consist of two or more pieces These can be physical partitions or further virtual disks typically a mirror disk Sequential I O operations on the virtual disk can be converted to I O operations on two or more physical disks This corresponds to RAID Level 0 RAIDO See also concatenated virtual disk RCVM mirror virtual disk RCVM simple virtual disk RCVM virtual disk switchover RMS The process by which RMS switches control of a userApplication over from one monitored node to another See also automatic switchover RMS directed switchover RMS failover RMS SIS symmetrical switchover RMS U42117 J Z100 4 76 371 Glossary symmetrical switchover RMS This means that every RMS node is able to take on resources from any other RMS node See also automatic switchover RMS directed switchover RMS failover RMS SIS switchover RMS system graph RMS A visual representation a map of monitored re
382. y and is therefore not possible Please specify the entire commandline or use hvem without further options to run the default configu ration This message appears when the user tries to start RMS without the c option and specifying other commandline options Action When using hvcm with the c option c configname should be the last arguments on the command line Alternatively to use with the default configuration enter hvcm without any arguments to start RMS on the local node and hvcm a to start RMS aon all nodes e RMS has failed to start didn t find a valid entry in the RMS default configuration file configfilename This message appears when the RMS default configuration file exists but does not contain a valid reference to a configuration to run Action Either place a default configuration file name in the RMS default config uration file or put the current configuration name in it that the user wants to start 310 U42117 J Z100 4 76 Console error messages Console messages in alphabetical order e RMS has failed to start invalid entry in the RMS default configuration file config filename The user is not allowed to start RMS if the default configuration has invalid entry in the RMS default configuration file The possible valid entries are 1 configname or 2 hvcm lt options gt C lt configname gt Refer to the hvcm man page for details on valid options in format 2 Action Rem
383. y request e DET 26 FAULT REASON Resource lt resource gt transitioned to a Faulted state due to the resource failing to come Online This message appears when the resource fails to come Online after executing it Online scripts that may transition the state of the resource to faulted Action Check to see what prevented the resource resource from coming Online e DET 28 lt object gt CalculateState was invoked for a non local object This must never happen Check for possible configuration errors During the processing of a request within the state engine a request or response token was delivered to an object that is not defined for the local host Critical internal error Action Contact field support e DET 33 DETECTOR STARTUP FAILED Restart count exceeded When a detector dies RMS attempts to restart it If a detector success fully restarts and once again dies too many times within one minute RMS assumes there is a problem terminates the restart cycle and prints this message Action Contact field support U42117 J Z100 4 76 251 GEN Generic detector Non fatal error messages e DET 34 No heartbeat has been received from the detector with pid lt pid gt lt startupcommand gt during the last lt seconds gt seconds The base monitor will send the process a SIGALRM to interrupt the detector if it is currently stalled waiting for the alarm In order to avoid stalling of RMS detectors
384. y script independent of RMS By selecting the resources configured for the application the user can execute the scripts that are to bring the resources Online or Offline To see the online scripts being executed you can go through the resource list which is displayed for this purpose in ascending order The return code indicates the proper functioning of the respective script e RMS CreateMachine Defines the list of machines which constitute the cluster During the activation phase the RMS configuration will be distributed to all the nodes in this list Applications managed by RMS must each be configured to run on one or more machines in this pool Therefore complete this step before creating any application e RMS RemoveMachine Removes machines from the list of cluster nodes U42117 J Z100 4 76 45 Creating and editing a configuration Using the Wizard Tools interface 3 4 2 2 Main configuration menu when RMS is running When RMS is running on the local node the Main configuration menu changes beginning with item 11 where the Configuration Push menu item replaces the Configuration Activate menu item see Figure 10 fuji2 Main configuration menu current configuration mydemo RMS up on fuji2RMS RMS down on fuji3RMS 1 HELP 2 QUIT 3 Application View 4 Configuration Generate 5 Configuration Copy 6 Configuration Remove 7 Configuration Freeze 8 Configuration Edit Global Settings 9 Config
385. ype Controller not yet consistent 1 HELP Elis 3 SAVE EXIT 4 REMOVE EXIT 5 ControlPolicy FOLLOW 6 AdditionalAppToControl 7 InParallel 8 FaultScript Choose the setting to process 6 Figure 53 Assigning a controller 4 12 Specifying controlled applications Once you specify a controller the wizard needs to know which application to control Select AdditionalAppToControl by entering the number 6 The menu that appears offers you a list from which to choose an application Figure 54 1 HELP 2 RETURN 3 FREECHOICE 4 appl Choose an application to control 4 Figure 54 List of applications to be chosen as controlled applications The controlled application is APP1 while APP2 is the controlling application Choose the application to be controlled as follows gt Select APP by entering the number 4 The controller flags menu appears Figure 55 84 U42117 J Z100 4 76 Configuration example Specifying controlled applications Set flags for sub application appl Currently set AUTORECOVER TIMEOUT AT180 1 HELP 2 3 SAVE RETURN 4 DEFAULT 5 MONITORONLY M 6 NOT AUTORECOVER A 7 TIMEOUT T Choose one of the flags Figure 55 Menu for setting controller flags There are a number of flags that can be set for a controlled application In this example the A AUTORECOVER flag has been set The A flag means If the controlled appli
386. ysical system resources e Manual Provides current manual pages for commands that are frequently used to configure an application with the RMS Wizards The hvw and the hvexec commands which were also described in this chapter are explained here in more detail Manual pages Information on the commands that are used for configuration with the RMS Wizards may also be obtained by calling up the manual pages Manual pages are available for instance for the hvw and the hvexec commands which were also described in this chapter U42117 J Z100 4 76 55 Further reading Using the Wizard Tools interface 56 U42117 J Z100 4 76 4 Configuration example This chapter provides an example of the configuration process using the RMS Wizards Two simple applications are configured for operation on a small cluster The example includes the following steps e Creating a configuration on page 58 e Adding hosts to the cluster on page 58 e Creating an application on page 61 e Entering Machines Basics settings on page 64 e Entering non basic settings on page 68 e Specifying a display on page 70 e Adding Alternatelps to the cluster Linux only on page 73 e Activating the configuration on page 77 e Creating a second application on page 79 e Setting up a controlling application on page 83 e Specifying controlled applications on page 84 e Activating the conf
387. zardsOnly 5 AdditionalAlternatelps 6 AdditionallI_List 7 IpAliaseslOJ fuji2RMS fuji2rmsAl01 fuji2rmsAl02 8 MaxAlternatelps 9 PreCheckTimeout 10 FirstAvailableDetector 0 11 LastAvailableDetector 127 12 MaxMenultemsDisplayed 13 DetectorDetails Choose the global setting to process Figure 41 Global settings main menu with Alternatelps for first host U42117 J Z100 4 76 75 Adding Alternatelps to the cluster Linux only Configuration example Item 7 IpAliases 0 now displays fuji2RMS and the names that correspond to its alternate interfaces Note that the menu header now indicates the configu ration is not yet consistent and the reason for the status change fuji3RMS has Alternatelps that have not yet been added to the cluster Repeat the above process for fuji3RMS this time adding fuji3rmsAI01 and fuji3rmsAl02 to the cluster The final Global settings main menu should appear as shown in Figure 39 1 2 3 4 5 6 7 8 9 10 1 4 12 13 14 HELP O SAVE EX AVE EXIT dditional N S S Additional A I I pAliasesL MaxAlterna IT howlurnkeyWizardsOnly Alternatelps _List J fuji3RMS fuji3rmsAI01 telps PreCheckTimeout FirstAvail LastAvaila ableDetector 0 bleDetector 127 MaxMenultemsDisplayed DetectorDe tails Choose the global setting to process Global settings main menu consistent pAliasestOl1 fuji2RMS fuji2rmsA
388. zing 151 object selection 100 object type 323 switching application to 23 25 Wait state clearing 140 Sysnode state change script 23 system files and site preparation 34 T tables clusterwide 119 command pop ups 122 taking an application offline 138 truss Solaris trace tool 194 turning off wizard debug output 189 turnkey wizards 32 43 55 DEMO 41 61 GENERIC 80 ORACLE 33 R 3 33 turnkey wizards See also wizards 61 U Unknown state 22 exiting 151 initial state 151 us directory 29 userApplication 54 activating 138 clearing fault 140 hvswitch command 135 object 11 object selection 100 object type 323 RMS tree 97 state change script 23 state information 119 taking Offline 138 with hvshut 133 userApplication node 396 U42117 J Z100 4 76 Index initializing 151 V vdisk 352 viewing application logs 146 GUI message 146 switchlogs 145 volume managers 1 16 Vxvm resource wizard 33 W Wait state 22 165 clearing faulted resources 25 clearing hung nodes 25 clearing SysNode 140 Warning state 21 WarningScript 23 WarningScript 23 WarningScript attribute 334 Web Based Admin View login 93 primary management server 92 secondary management server 92 Wizard Kit 17 19 configuration 32 overview 10 Wizard Tools 32 wizards basic settings 47 configuring 16 17 32 debug level 190 debug messages 189 debug reporting 190 DEBUG statements 190 DEMO turnkey 41 frequently used items 41 gener
Download Pdf Manuals
Related Search
Related Contents
Samsung LW15B13C Manual de Usuario Electrolux E36WC75G User's Manual LOVD 3.0 user manual Build 3.0-05 Battic Door Energy Conservation Products Rectangular Fireplace Plug BRK-FP-05-01 Instructions / Assembly - Service, Support Copyright © All rights reserved.
Failed to retrieve file