Home

Platform RTM User Guide - Platform Cluster Manager

image

Contents

1. Manual database backup and restore procedures For RTM 1 04 and later versions all components are backed up and packed into a t gz file using the RTM Console H owever previous versions of RTM require that you manually backup your database Back up the Web site This is always a good step whenever upgrading Cacti In order to do this you should backup all directories except the log andr r a directories If you are not concerned about the size of the backup you can continue to backup those directories though Below is a command example cd var www htm tar exclude rrd exclude log cf cacti_backup tar cact If your RTM upgrade does not involve the Cacti product you can greatly speed and simplify your backup using the following commana cd var www html plugins tar exclude rrd cf grid_backup tar grid Back up the MySQL database Platform RTM provides a nice utility for backing up your RTM database W hen used it backs up your database schema and some very important Cacti and RTM tables to disk It does not backup somerather large tablesin RTM simply dueto their size and purpose The Cacti RTM tables not backed up include grid_jobs grid_jobs_ finished grid_jobs_reqhosts grid_jobs rusage grid_job_interval_stats grid_job_ memperf grid_arrays grid_job daily stats poller_output poller_output_boost Note SSH private key pair files for grid cluster control authentication are not backed up the backup process
2. You can configure the following on this page On demand RRD update settings for Boost Boost server settings Image caching settings for Boost Mail DNS tab Click the M ail DNS tab to open the Cacti Settings Mail DNS page and configure email and DNS settings for the Cacti server You can configure the following on this page Emailing options Location of the sendmail binary fileon the RTM host if sendmail is selected as the mail service If the file is found and verified the message OK FILE FOUND appears below this field SMTP server options DNS options Configuring Date Time and License Information Admin Tab console Ns F 5 syslogs admin i Date amp Time tab N avigate to the Datetime Edit page by clicking the Admin tab then the Date amp Time tab Thispagedefinesthetimezone and current dateand time Y ou can also specify an NTP server Note Platform RTM User Guide 45 Administering Platform RTM If the server is not able to synchronize the date time you can manually set the date time NTP overrides any manual settings once the server is able to synchronize This page contains the following fields Timezone Setting Set this to your local time zone Date Setting Select the current date Time Setting Enter the current time NTP Server Specify a preferred NTP server After changing any of these settings RTM restarts the system services License tab Navigate to the License
3. Type The specified purpose of the license server Location The specified physical location of this license server Collect Status The current status of the license server Status can beDown Recovering Up and Unknown Current Time Thetime taken to poll this server Average Time The average time taken for polling this server Max Time The maximum time taken for polling this server Availability The availability of the license server based on the percentage of successful polling reported as a percentage Choose an action Select one or more checkboxes for the servers on which to perform an action for example Enable Disable or Clear Stats then select an action and click go Navigate to the Utilities page by clicking Utilities under the Grid M anagement section of the Console menu bar This page shows information about RTM utilities as related to database administration such as data backup purging and record removal along with status information about cluster pollers a View Grid Process Status Click to open the Grid Process Status page and show status information associated with cluster polling processes for example statistics for cluster poller runtime database maintenance licence collection etc Force Cacti Backup Click to perform a backup of key Cacti and RTM database tables See the Appendix for more information on database backup and restore Purge Completion Factor Data Click to purge the completion f
4. Click the Visual tab to open the Grid Settings Visual page and control a number of visual settings including the size of the summary icons and the frequency of refresh intervals This page contains the following fields Screen Refresh Interval The interval used by RTM to refresh a page you are currently viewing Exclusion Filter Status Determines whether or nottheexclusion filter ison or off by default If thisoption ischecked only hosts that match the exclusion filter settingsin thefield below are shown in the host dashboard Exclude H ost States Select one or more of the exclusion states to eliminate hosts of that state from appearing in the dashboard Doing this leaves only the hosts with states you care about in the dashboard For this to work you must enable check Exclusion Filter Status To select multiple states press and hold the lt CTRL gt key while choosing host states from the list Icon Size Transition H ost Count W hen this number is exceeded RTM uses smaller icons to represent hosts on the Dashboards gt H ost page Hosts Per Summary Row large icons Determine the number of hosts to show on each line of the Dashboards gt H ost page provided the configured Icon Size Transition host Count is not exceeded The icons presented in this state are larger and include the host name Platform RTM User Guide 31 Monitoring the Cluster This setting is dynamic Therefore if you filter to reduce the number of hosts di
5. Click theThold tab to open the Grid Settings THold pageand configurecluster threshold settings including thresholds to identify when resources areidle closed low busy or starved 40 Platform RTM User Guide Administering Platform RTM You can configure the following on this page W hen to consider a host idle or closed based on job slots W hen to consider a host idle with jobs based on CPU percentage W hen to consider a host low on resources based on load average W hen to consider a host s low physical memory urgent W hen to consider a host s low swap memory urgent W hen to consider a host s low temp memory urgent The point at which an IO rateis high enough to consider a host low on resources K b sec The point at which a paging rate is high enough to consider a host low on resources pages sec W hen to consider a host busy based on CPU W hen to consider a host busy based on the load average The point at which an 10 rateis high enough to consider a host busy K b sec The point at which a paging rate is high enough to consider a host busy pages sec W hen to consider a host idle based on comparisons of load vs running jobs shows if a host has orphaned or non cpu intensive jobs running W hen to consider ahost idle based on acomparison of load vs running jobs shows if a host may be
6. Platform RTM User Guide Platform RTM Version 2 0 Release date March 2009 Platform a Copyright We d like to hear from you Document redistribution and translation Internal redistribution Trademarks Third party license agreements 1994 2009 Platform Computing Inc Although the information in this document has been carefully reviewed Platform Computing Corporation Platform does not warrant it to be free of errors or omissions Platform reserves the right to make corrections updates revisions or changes to the information in this document UNLESS OTHERWISE EXPRESSLY STATED BY PLATFORM THE PROGRAM DESCRIBED IN THISDOCUMENT IS PROVIDED ASIS AND WITHOUT WARRANTY OF ANY KIND EITHER EXPRESSED ORIM PLIED INCLUDING BUT NOT LIMITED TO THEIMPLIED WARRANTIESOF MERCHANTABILITY AND FITNESSFORA PARTICULAR PURPOSE IN NO EVENT WILL PLATFORM COMPUTING BE LIABLE TO ANYONE FOR SPECIAL COLLATERAL INCIDENTAL OR CONSEQUENTIAL DAMAGES INCLUDING WITHOUT LIMITATION ANY LOST PROFITS DATA OR SAVINGS ARISING OUT OF THE USE OF ORINABILITY TO USE THISPROGRAM You can help us make this document better by telling us what you think of thecontent organization and usefulness of theinformation If you find an error or just want to makea suggestion for improving this document please address your comments to doc platform com Your comments should pertain only to Platform documentation For product support contact
7. Restore the database from the archives Database optimization schedule MySQL performs database optimization according to this schedule Archiving tab Click theArchiving tab to open theGrid Settings Archiving pageand configure database archiving settings Data archiving allows deep dive analysis that will not impact the system because you can perform this analysis on the archive database instead of on a database that is currently in use You can configure the following on this page Enable data archiving Frequency of data archiving Database type that will store data archives Name of host receiving data archives Name of the database receiving data archives Database account user name password and port for connecting with the database EnableRRD file creation for archiving during record purging Note Enabling this will result in very large data archives Storage location of archived RRD files Paths tab Click thePathstab to open the Grid Settings Paths pageand configure cluster directories and file paths You can configure the following on this page Location of log files on poller hosts for example var www html cacti log cacti log Ifthe directory is found and verified themessage OK FILE FOUND appears below this field Location of job rusage RRD and image files for example opt cacti gridcache If the directory is found and verified the message OK DIR FOUND appears below this field Thold tab
8. Load Batch The current Load and Batch status for the host group If no Status filter is currently set this field will show N A Otherwise it will show the current value selected for the Status filter Total Hosts The total number of hosts in this host group AVG CPU Theaverage CPU utilization for hosts in this host group AVG rlm The average exponentially averaged effective CPU run queue length for this host group over the last minute Avg Effic The average efficiency of the host group Total CPU The overall CPU utilization rate of the host group M ax Memory The maximum memory consumed by the host group M ax Swap The maximum swap usage of the host group 24 Platform RTM User Guide Monitoring the Cluster M ax Slots The maximum number of job slots available for this host group Num Slots The number of jobs slots used by jobs dispatched to this host group Run Slots The number of job slots used by jobs running on this host group SSU SP Slots The number of job slots used by system suspended jobs on the host group USU SP Slots The number of job slots used by user suspended jobs on the host group Reserve Slots The number of jobs slots used by pending jobs that have job slots reserved within the host group By Queue page Navigate to the By Queue page by clicking By Queue under the ob Info section of the Grid menu bar This display is very similar to the LSF b queues command with these exceptions It includes the a
9. Thesize of each job record depends on job volume and your cluster settings The system can hold a maximum of 10 million records Use this upper limit along with the approximate number of jobs per week in your cluster to determine the ideal retention period Retention period for individual job records Individual job records are kept for this period of time after the job ended The size of each job record depends on job volume and your cluster settings Retention period for daily summary statistics Record of daily summary statistics are kept for this period of time after the job ended As these records are added every day you can keep records for a longer period of time depending on thejob volume Smaller clusters with less than one million jobs per year can have a retention period as high as three years Maximum number of database records to remove Maximum down time for daemons disabled for maintenance purposes Enable database backup This enables a disaster recovery backup to restore your Cacti and RTM configuration Some job data is lost during the database restoration though you can use other utilities to restore all the job data Note Database backup files are disk intensive for larger clusters Database backup schedule Number of database backups to maintain Database backup file location Restore the database from the archives Database optimization schedule MySQL performs database optimization according to this schedule A
10. job submission details thejob execution environment current last job status and a graphical job history Job Name Thename of the job Status The current status of the job State Changes The number of times that the status of the job has changed User ID The LSF user who submitted the job Mem Usage Total resident memory usage of all processes in the job in MB VM Size Total virtual memory usage of all processes in thejob in MB CPU Usage CPU utilization for this job CPU Effic The efficiency with which this job is using the CPU allocated to it expressed as a percentage Start Time The time at which the job was started Pend The length of timein which thejob has been in the pending state Run Thelength of time for which thejob has been running SSusp The length of time the job has been suspended by the system Thresholds At the bottom of the Details page there are color codes that indicate job efficiency thresholds including W arning Alarm Flapping and Dependencies Y ou can set the colors for each of thesethresholds from the Console tab on the Grid Settings gt Status Events page along with the thresholds themselves User Group Info section The User Group Info section is located in the Grid menu bar Users page Navigate to the Users page by clicking Users under the User Group Info section of the Grid menu bar This page shows job information pertaining to an LSF user User Name The name of the LSF
11. ownership and permissions When you receive your RTM upgrade there should be two directories in the upgrade path They include both the cacti and poller directories In order to upgrade the Cacti W eb Site and Grid Plugin simply run the following command cp rp lt upgrade_path gt cacti var www html chown R cacti cacti var www html Platform RTM User Guide 55 Performance and Maintenance Then verify that your conf i g php remains unchanged Then refresh your browser that should already be pointed to the RTM web site If the web site does not come up contact Platform Technical Support ASAP Upgrade the MySQL database Once you have upgraded your Cacti W eb Site and Grid Plugin you must run a Database Upgrade script As Platform RTM matures several changes are made to the RTM tables to accommodate new features Note As of Platform RTM 1 01 larger tables such as the grid_jobs and grid_jobs_rusage do not typically undergo structural changes If one of these larger tables were to require modification you would be informed up front in order to advise you of the downtime that may be required to perform any upgrade to those tables In order to upgrade your database schema run the following commands from a shell cd var www html cacti plugins grid php q database upgrade php If after running the database upgrade script you receive error messages contact Platform Technical Support ASAP Upgrade the P
12. because the data source drops below this value the threshold triggers the specified action Norm threshold If the threshold is breached then returns to normal the threshold triggers the specified action In the Data Source tem page make any further changes to your threshold configuration Click Save to create your new threshold Modify threshold settings PWN 5 Click the Console tab Under the M anagement section of the Console menu bar click Thresholds Click the name of the threshold that you want to modify In the Data Source Item page make desired changes to your threshold configuration The Event Triggering sections allow you to configure threshold event triggering which specifies actions commands shell scripts or host level actions to take if the threshold conditions are met High threshold If the threshold is breached because the data source exceeds this value the threshold triggers the specified action Low threshold If the threshold is breached because the data source drops below this value the threshold triggers the specified action Norm threshold If the threshold is breached then returns to normal the threshold triggers the specified action Click Save to apply the your changes to the threshold configuration Delete thresholds Delete thresholds when you no longer need the alerts that they trigger 1 Click the Console tab Platform RTM User Guide 47 Administering Platform RT
13. database Enable automatic data archiving Enable data archiving to save legacy job and job related data to an archive database during scheduled server maintenance and to archive job detail records to an archive directory or file server 1 Click the Console tab 2 Under the Configuration section of the console menu bar click Grid Settings 3 Click the Archiving tab 4 To enable data archiving of legacy job and job related data select the Enable Data Archiving box and specify the data archiving settings 5 To enable data archiving of job detail records prior to data purging select the Create RRD s During Job Detail Purge box and specify the path to the archive directory or file server Restore a backed up database via RTM Console for RTM version 1 04 After you havecompleted any scheduled server maintenance or finished upgrading your RTM version or license you must restore the Cacti database that you previously backed up 1 Click the console tab 2 Under the Configuration section of the console menu bar click Grid Settings Platform RTM User Guide 53 Performance and Maintenance 3 Click the Maint tab 4 Scroll down to the Database Backups section of the page and find the Database Restore option 5 Browse to the location of your previously backed up database and then click Save to upload and restore the t gz backup file If the file is successfully restored the message Save succesful displays
14. gt Status Events Total H osts The total number of hosts in this cluster Total CPUs Thetotal number of CPUsin this cluster Total Clients The total number of clients in this cluster Collect Freq The configured data collection frequency Collect Timeout The configured data collection time out Job Minor Freq The configured job minor frequency Platform RTM User Guide 15 Monitoring the Cluster Job M ajor Freq The configured job major frequency Job Timeout The configured job time out LIM Timeout The configured lim time out Choose an action Select one or more checkboxes for the clusters on which to perform an action for example Enable or Disable then select an action and click go License Servers page Navigate to the License Servers page by clicking License Servers under the Grid Management section of the Console menu bar This page shows information about RTM license servers and the pollers that collect data from them Utilities page Server Name The defined name for the license server Click aname to open the License Server Edit page and edit server properties connection settings and support information Poller Name The name of the poller associated with this server Poller ID The D assigned to the poller Poller Interval The license poller interval Vendor The specified software vendor providing services on this license server Department The specified department responsible for this license server
15. in the Cacti documentation find it here http cacti net documentation php The following sections and their corresponding pages are specific to configuring how Platform RTM manages your clusters Management Graph M anagement Graph Trees Data Sources Devices Threshold pages These pages allow you to configurea number of settings related to monitoring your cluster using Platform RTM Grid Management Pollers Clusters License Servers and Utilities pages These pages allow you to add LSF clusters and perform certain database administration functions Management section TheM anagement section is located in the Console menu bar Thresholds page Navigate to the Thresholds page by clicking Thresholds under the M anagement section of theConsolemenu bar This page shows the configured thresholdsin your cluster A threshold triggers an alarm if your clusters hosts queues or jobs meet the conditions of the threshold Name The name of the cluster and the threshold Click the nameto change the threshold settings Type The type of threshold for example High Low Baseline and Time Based High Thehigh threshold boundary value If thecurrent valueof themonitored datasource item is greater than this value for a specified duration the threshold triggers an alert Low Thelow threshold boundary value If the current value of the monitored datasource item is lower than this value for a specified duration the threshold triggers an a
16. support platform com This document is protected by copyright and you may not redistribute or translate it into another language in part or in whole You may only redistribute this document internally within your organization for example on an intranet provided that you continue to check the Platform W eb site for updates and update your version of the documentation Y ou may not makeit available to your organization over the Internet LSF is a registered trademark of Platform Computing Corporation in the U nited States and in other jurisdictions ACCELERATING INTELLIGENCE PLATFORM COMPUTING PLATFORM SYMPHONY PLATFORM JOBSCHEDULER PLATFORM ENTERPRISE GRID ORCHESTRATOR PLATFORM EGO and the PLATFORM and PLATFORM LSF logos are trademarks of Platform Computing Corporation in the United States and in other jurisdictions UNIX is a registered trademark of The Open Group in the United States and in other jurisdictions Linux is the registered trademark of Linus Torvalds in the U S and other countries Microsoft is either a registered trademark or a trademark of Microsoft Corporation in the United States and or other countries Windows is a registered trademark of Microsoft Corporation in the U nited States and other countries Intel Itanium and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the U nited States and other countries Other products or services mentioned in this document are identi
17. text string description for the location of the data poller for example Data Center Choose an action Select one or more checkboxes for the pollers on which to perform an action for example Delete then select an action and click go Navigate to the Cluster page by clicking Clusters under the Grid M anagement section of the Console menu bar This page shows information about LSF clusters including configured time out thresholds and job efficiency information and the pollers that collect data from them Add Add acluster for RTM to monitor Click to open the Cluster Edit new page and specify the properties of the cluster Cluster Name The defined name for the cluster Click a name to open the Cluster Edit page and edit cluster properties defaults and various collection settings ID TheID assigned to the cluster Poller Name The name of the poller associated with this cluster Collect Status The current data collection status for this cluster Status can be Disabled Up J obs Down Down Diminished Admin Down and Maintenance Efic Status An indicator of job efficiency within this cluster based on configured thresholds Status can beOK Recovering Warn Al arm and N A Thresholds are set from Console gt Grid Settings gt Status Events Efic Percent An indicator of the average efficiency of running jobs within the cluster reported as a percentage The minimum runtime setting can beset from Console gt Grid Settings
18. the following on this page Enable polling Poller type Polling interval Cron or scheduled task interval Maximum concurrent poller processes Spine specific execution parameters Method used to determine host availability None Ping SNMP or both Ping settings Failure count Number of polling intervals a host must be down before logging an error Recovery count Number of polling intervals a host must remain up before returning the host to an up status Graph Export tab Click theGraph Export tab to open theCacti Settings Graph Export page and configure graph export settings You can configure the following on this page Export method Presentation method Tree display export settings Thumbnail settings Export directory path Local scratch directory path Export schedule Platform RTM User Guide 43 Administering Platform RTM FTP server options Visual tab Click the Visual tab to open the Cacti Settings Visual page and configure Cacti display settings You can configure the following on this page Graph display settings Maximum data query field length Graph creation settings Data source display settings Device display settings Log management settings RRDtool font settings Maximum number of rows to display on a single page for syslog events Authentication tab Click the Authentication tab to open the Cacti Settings Authentication page and configure Cacti authentication settings You can co
19. to make it easy for them to follow up on the alert Tagsfor threshold names areenclosed by pi pe characters whiletagsfor alert email templates are enclosed by angle brackets lt gt Not all placeholders are available for threshold names some placeholders are only available for alert email templates The following is a list of the placeholders available for your thresholds Tag for threshold Tag for alert email template Description name clusterid lt clusterid gt The ID of the cluster cluster_name lt cluster_name gt The name of the cluster lt cluster_ Isfmaster gt The name of the LSF master host cluster_ lsfmaster for the cluster cluster version lt cluster_version gt The version of LSF running in the cluster cluster_limport lt cluster_limport gt The port number of LIM running in LSF on the master host lt custom custom field_name gt The custom data value from the custom custom fi el data source that is linked in this d_name j alert For example custom percent custom status host_hostname lt host_hostname gt The host name of the device linked in this alert host_description lt host_description gt The host description of the device linked in this alert Platform RTM User Guide 13 Monitoring the Cluster Placeholder name Tag for threshold Tag for alert email template Description name Threshold Not available lt DESCRI PTI ON gt The threshold description descriptio
20. which the host group belongs Platform RTM User Guide 27 Monitoring the Cluster Status The current status of the host group AvgRq 15 sec Theexponentially averaged effective C PU run queuelength for hosts within the group for the past 15 seconds AvgRq1 min Theexponentially averaged effective CPU run queue length for hosts within the group for the past 1 minute AvgRq 15 min The exponentially averaged effective CPU run queue length for hosts within the group for the past 15 minutes Avg CPU Theaverage CPU utilization for hosts within the group Avg Page Rate The average memory paging rate exponentially averaged over the last minute for hosts within the group in pages per second Avg I O Rate The average disk I O rate exponentially averaged over the last minute for hosts within the group in kilobytes per second Total Logins The total number of current login users for hosts within the group Avg IdleTime On Unix theidle time of the host in minutes On Windows thetimea screen saver has been active on the host The average idle time is computed for all hosts with the group Avg Temp Avail The average amount of free space in tmp G gigabyte M megabyte Avg Swap Avail The average amount of swap space available for hosts in the group G gigabyte M megabyte Avg Mem Avail The average maximum amount of physical memory available for user processes on the host G gigabyte M megabyte Host Info section TheH
21. For example if you installed RTM to opt rtm open ashel and perform the following steps tar zxvf cacti _db backup num tgz cacti _backup rtm etc rtmlic cp cacti backup rtm etc rtm lic opt rtmetc rtmlic service mgrd restart Upgrade the Cacti Web site and grid plugin In order to upgrade the Cacti W eb Site you need to pay particular attention to a few points 1 You must makea backup copy of your lt path_to_cacti gt include gl obal php file This file is important as it contains your database connect string and your list of installed plugins 2 You must know whether or not you have modified the default RTM poller scripts If you have done so you will need to make a separate backup of your lt path_to_cacti gt scripts directory If you have added new scripts that arenot apart of the standard RTM deployment you need not backup this directory 3 If you have modified default RTM Data Queries you need to backup your lt path_to_cacti gt resource directories Again if you havenot modified the default installation of RTM Data Queries you do not need to proceed with this step Now that you have done this note the ownership of all filesin RTM The RTM installer recommends that the A pache site be owned by the cacti user account and that the Apache binaries be run by the same If this is the case then all filesin the var www ht ml cacti directory and sub directories should be cacti cacti Regardless of file ownership not the
22. Info page by clicking the Admin tab then the License tab The first time you logon to theRTM Console you must providelicensing information from this page Usethis page if your license expires and you need to updateit or if you wish to upgrade your demo license to a full feature version You can either browse to the location of your license file or you can copy and paste the text from your licensefileinto theappropriatefield on this page Click Save to completethelicense update Configuring Thresholds and Alerts Create a threshold to trigger alerts Create a threshold to monitor your cluster and trigger alerts Create a threshold from a graph template Create a threshold using a graph template as the source 1 Click the Console tab Under the Management section of the Console menu bar click Thresholds Click Add on the top right side of the Clusters page In the Source field select Graph Template Select the appropriate host name and graph template for the new threshold and click Create Specify the threshold values for which you want to trigger an alert and click Create 7 In the Data Sourcel tem page make any further changes to your threshold configuration uBR WN e2 The Event Triggering sections allow you to configure threshold event triggering which specifies actions commands shell scripts or host level actions to take if the threshold conditions are met High threshold If thethreshold is breac
23. M 2 Under the M anagement section of the Console menu bar click Thresholds 3 Click the checkbox at the right side of each threshold that you want to delete 4 In the Choose an action field select Delete and click Go Monitor alerts 1 Click the Thold tab If there are several thresholds you can use the Threshold Status menu bar to filter the threshold view 2 Click the name of athreshold with triggered alerts to see a list of the hosts and the specific data source values that triggered the alert Acknowledge alerts for a single threshold Acknowledge the triggered alerts for a single threshold to prevent future email and syslog notifications 1 Click the Thold tab If there are several thresholds you can use the Threshold Status menu bar to filter the threshold view 2 IntheActions column click theacknowledgeicon next to thenameof thethreshold with the triggered alerts The acknowledge icon changes into the reset acknowledge icon You can click the reset acknowledgeicon next to thethreshold to allow thethreshold to resend future email and syslog notifications with each triggered alert for the threshold Acknowledge alerts for multiple thresholds Acknowledge the triggered alerts for multiple thresholds to prevent future email and system log notifications 1 Click the Console tab If there are several thresholds you can use the Threshold Status menu bar to filter the threshold view 2 Under the M anagement secti
24. TM User Guide About Platform RTM Interface component Description Tab graphs Tab thold Tab grid Tab syslogs Tab admin Tab settings Navigation bar Logout link Opens the Graph page View graphs to which your Platform RTM Administrator has given you access Opens the Thresholds page View information about the configured thresholds in your cluster Opens the Grid page View information about your LSF Cluster and submitted jobs Opens the Syslogs page View entries from the UNIX log files located in the var 0g directory in each host in the clusters that RTM monitors Opens the Admin page Update RTM licenses from here along with date and time settings Allows you to customize either the layout of your graphs or the on screen presentation of your grid At times there are tabbed options visible to the right of the Settings tab These change depending upon which page is opened For example if you are on the Graphs page tab options are available to allow you to switch between graph views tree list or preview You may not see the Settings tab at all if the Cacti administrator has restricted your access This is the area just below the tabs where you can find navigational breadcrumbs The circular button on the left allows you to hide and show the grid menu bar described in the next section The area to its right allows you to easily navigate up a menu level when you are inside
25. Yes to run the grid control command Platform RTM User Guide 51 Performance and Maintenance Performance and Maintenance Database Maintenance Upgrading maintenance overview Backup from the RTM Console RTM versions 1 04 and later allow you to back up and restore your configuration within the RTM Console The following files are backed up rtm 1ic RTM licensefile This file is not restored automatically when you restore your configuration within the RTM console sfpollerd conf Database file containing the credentials Isf conf Thelsf conf file associated with each cluster ego conf for LSF 7 x clusters only Theego conf file associated with each cluster All tables in the Cacti database are backed up except for the following Manual backup grid_jobs grid_jobs rusage grid_job_interval_stats poller_output poller_output_boost Prior toRTM version 1 04 the upgrade process must be followed manually M anual steps are found at the end of this appendix and include the following 1 NOIR WN Backup the web site Backup the M ySQL database Backup the Platform RTM poller Backup the RTM licensefile U pgrade the Cacti web site and grid plugin Upgrade the M ySQL database Upgrade the Platform RTM poller Backup existing database via RTM Console for RTM version 1 04 Prior to upgrading to a newer version of RTM or to a fully licensed version you must first backup theexisting database Backing up your data
26. acecunvitecterensacceaitenseeecesmhe ndsaes 48 Performance and Maintenance cccccccccececeeeeeeeeeeeceecneeeeeeesceneeeeeeeessseaeeeeeeseesiseeeeeeenees 52 Database Maintenant sexs stares vabatiecastasabatiaevendweendaivehduats euveddsadnsutakd tadsbeuasaasald E aa 52 Cluster Administration cceeeccceccsecceeeeeeeeeeeeeeseeaeeeeeeecaeeeeeeseaaeeeeeseeaeeeeessesiaeeeeeseeaes 56 ISSUCS cores E lanes boccsamahaveee E ochouuenstvere beitecsuieeciniee 58 Platform RTM User Guide 3 4 Platform RTM User Guide About Platform RTM About Platform RTM Introduction to Cacti and Platform RTM About this guide Platform RTM caters to three user groups who are each responsible for HPC capacity management and planning LSF Administrators LSF users and IT managers T his guide focuses on LSF administrators and LSF users Platform also provides RTM download installation and release information on my platform com For information specific to Cacti itself see the Cacti documentation at http cacti net documentation php About RTM and its interaction with Cacti Platform RTM provides arich graphical view of LSF clusters that to date has not been possible with any other product RTM communicates the overall health of multiple LSF clusters as well as data about past cluster performance RTM is based on Cacti a widely popular Open Source product Cacti was developed as an Open Sourcetool to provide IT administrators a way to graph
27. actor records from the tables and allow for re creation of completion factor information based upon achangein settings M anage Grid H osts Click to open the Manage Hosts page and selectively remove client records from the host database Backup Files Click a file name to download the backup file 16 Platform RTM User Guide Monitoring the Cluster Configuration section Grid Settings page General tab Poller tab Navigate to the Grid Settings page by clicking Grid Settings under theConfiguration section of the Console menu bar Thereis atab for each category of grid settings that you can change Contact Platform Computing for assistance in determining the optimal settings for your clusters Click the General tab to open the Grid Settings General page and configure the default user settings for your cluster You can configure the following on this page Domain names to strip from the display This makes the name output shorter for hosts belonging to the specified domain names It allows you to conserve display space if you are displaying hosts in common domains within your organization Summary hostnames to substitute Minimum user screen refresh interval Restrict the minimum refresh interval that your users can set to reduceload on thesystem You should set a higher refresh interval for larger clusters to reduce system load Maximum job zoom time range Restrict the maximum job timein which your users can zoom in after whic
28. added the RTM host successfully to the LSF cluster a Loginto theRTM host b FromtheRTM host uset el net tologinto the LSF LIM port of your RTM host The default LIM port is 6879 for LSF 6 2 clusters and 7869 LSF 7 x clusters For example for LSF 7 x clusters telnet 7869 If you connect to the IP address of the LSF master host you added the RTM host successfully Add all hosts in the cluster to RTM as devices Usethegrid add cluster php script to add all hostsin the cluster list to the device list in RTM 1 Inthecommand line navigateto thep ugi ns gri d subdirectory of the cacti installation directory For example cd opt cacti plugins grid 2 Usephp torunthegrid add cluster php script to add all hosts in the cluster list to RTM as devices and create related graphs using the Grid H ost template php q grid_add_cluster php type 1 clusterid all template 14 You can view these hosts in the device list by clicking Devices in the M anagement section of the Console menu bar Issues Issues to consider LSF ports that RTM requires RTM does not need access to thes bat chd slave batch daemon and RES remote execution server ports as it does not need to communicate with these LSF components RTM requires access to the LIM load information manager and mbat chd master batch daemon ports If you do not specify these ports RTM will not be able to communicate with the LSF cluster The default LIM port is 6879 fo
29. amp View active jobs View graph or source properties Export data to CSV format open or Return to top of page save to file View license checkouts Connect to remote host opens a console Ho 4 window Create threshold Edit threshold of Disable threshold Enable threshold F Reset acknowledged threshold resume triggering alerts Acknowledge threshold stop triggering alerts Toggle threshold rules Add a syslog alert rule amp Add a syslog removal rule Page layout preferences and customizations Depending on the configurations made by the RTM Administrator you may or may not be allowed to modify personal grid settings These settings control a users default environment and interface display From the Setting page you can control graph colors change the default number of rows to display in any section or show hide certain fields within the interface The Settings page provides access to numerous other tabbed pages where you can configure in detail your display preferences 10 Platform RTM User Guide Monitoring the Cluster Monitoring the Cluster Configuring Cluster Management Console Menu Tab console ii admin Console menu bar Cluster Management Overview The Console menu bar has eight sections Create graphs Management Grid M anagement Collection M ethods Templates Import Export Configuration and Utilities M ost aredefault Cacti utilities and features and are documented
30. aph embedded into the email Threshold date Not available lt DATE_RFC822 gt The threshold date in RFC 822 Grid Management section The Grid M anagement section is located in the Console menu bar 14 Platform RTM User Guide format For example iw Ot Jen 2008 Wis O 0100 Pollers page Clusters page Monitoring the Cluster Navigate to the Pollers page by clicking Pollers under the Grid M anagement section of the Console menu bar This page shows information about RTM pollers These pollers collect information from the LSF cluster RTM uses this data to build various and report for RTM users and administrators Add Add anew poller Click to open the RTM Poller Edit new page and specify the properties of the new poller Poller Name The defined name for the poller Click anameto open the RTM Poller Edit page and edit poller properties for example the poller name LSF version bin directory location poller location and support information Poller ID The D assigned to the poller LSF Version TheLSF version running on the associated cluster License Threads The number of license threads that the poller uses for data collection Data collection is faster if you specify more license threads Physical Location The physical directory location of thelocal RTM poller for example opt rtm sf62 bin If thedirectory is found and verified the message 0K DIR FOUND appears below this field Support Information Enter a
31. baseis also recommended duringscheduled server maintenance 1 2 3 Click the Console tab U nder the Configuration section of the console menu bar click Grid Settings Click the Maint tab 52 Platform RTM User Guide Performance and Maintenance 4 Scroll down to the Database Backups section of the page and ensure that the following options are set Backup Cacti Database Check this box to ensuretheC acti database is backed up when the maintenance script runs Database Backup Location Providealocation if abackup directory does not yet exist If the directory is found exists the message OK DIR FOUND displays under the directory field 5 Under the Grid Management section of the console menu bar click Utilities 6 In the Database Administration section of the Utilities page and click Force Cacti Backup The Backup Files table at the bottom of the page updates with the newly created backup file modification date and file size 7 Click the backup file name to download it to a specified location 8 Once downloaded verify the t gz file contains the following files cacti _db_ backup sql cacti db struct_backup sql rtm etc rtmlic rtm etc Isfpollerd conf rtm etc cluster_id sf conf rtm etc cluster_id ego conf for LSF 7 x clusters only Onceall files are successfully verified and backed up you can upgrade to anew RTM version or perform server maintenance without fear of losing or corrupting your existing
32. ch field allows you to specify a free format search string The search only looks for key data fields that cannot easily be found using the drop down filters ha Host All Usethis field to select key fields in the data you are currently viewing For example on theJob Info gt By H ost page use the Search field to filter the H ost Namefield On the ob details page use the Search field criteria to filter the job ID and name Time span selection bar Presets Last Day From 2008 02 20 11 59 FS To 2008 02 2111 59 Z 1Day vm refresh clear 8 Platform RTM User Guide About Platform RTM On certain pages you can select a time span to view graphs and completed job details for a selected time range for example Grid gt Job Info gt By Group Array The Presets field allows the selection of data between common time intervals such as the last day hour week etc Calendar links beside to To and From fields let you define custom time date ranges Thearrows on theright side of the Time Span Selection Bar allow you to either advance or go back an amount of time specified within the corresponding drop down list For example if you are currently looking at jobs that finished in the last day and you click the left arrow on the time shifter the jobs that completed during the previous 24 hours now display Alternatively if you select 1 W eek from the drop down list and then click the left arrow jobs over a 24 hour pe
33. clusters Cluster Name The name of the LSF cluster User Name Thename of the cluster user Queue Name The name of the queue Project Name The name of the project Exec Host Thename of the execution host Result The result of the last job submitted to the execution host Total J obs The total number of jobs submitted on the host Total Slots The total number of slots available on the host Avg W Time Average job wait time Total W Time Total job wait time System Time The amount of system used by submitted jobs User Time The amount of user time used by submitted jobs Start Date The start of the reporting period End Date Theend of the reporting period Platform RTM User Guide 29 Monitoring the Cluster License Details page Navigate to the License Details page by clicking License Details under the Reports section of the Grid menu bar This page shows license details for License Scheduler projects Feature Name The license name Server Name The name of the FLEXIm license server Version The version number of this licensed product Number Expiring The number of licenses for this product InU se Count The number of licenses currently in use Expiration Date The date these product licenses expire License Usage page Navigate to the License U sage page by clicking License Usage under the Reports section of the Grid menu bar This page shows license usage provided by the FLEX m servers Feature Name The lic
34. description of thehost Thisisthesameasthe host name for automatically added LSF hosts ID Thehost or device D Graphs The number of graphs for the host Data Sources The number of data sources for the host Platform RTM User Guide 33 Monitoring the Cluster Status The status of the host Event Count The number of threshold triggered alerts Hostname The name of the host Current ms The current host ping time in milliseconds Average ms The average host ping time in milliseconds Availability The percentage of time that the host is available Viewing UNIX Log File Entries Syslogs Tab a g console graphs ali thald Syslogs tab Navigate to the Syslogs page by clicking the Syslogs tab This page displays entries from the UNIX log files located in the var og directory in each host in the clusters that RTM monitors Each data record displayed hereis an entry in one of the log files You can create an alert rule to notify you of future log entries or a removal rule to automatically remove logfile entries by clicking Alerts or Removals in the Rules window at the top right of the Syslogs page This page contains the following fields Host The name of the host in which the log file entry is recorded Date The date of the log file entry taken from its time stamp Time Thetime of the log file entry taken from its time stamp Message The contents of the log file entry Facility The name o
35. e presented with detailed information Beside each graph you can click a wrench icon to display helpful debugging information collected by an RRD tool graphs thold grid N sysloga admin settings le T ius Click the Graphs tab then the List view tab on the right side of the tabbed menu interface to access the graphs in List M ode Select one or more cluster names from the list and then click View The selected graphs display graphs thold i grid i syslogs admin E Si Ss h settings Click the Graphs tab then the Preview view tab on theright side of the tabbed menu interface to access the graphs in Preview M ode From the preview page you can filter by host to limit the number of graphs displayed Platform RTM User Guide 35 Monitoring the Cluster Settings tab from Graph page oer Se grid i syslogs admin 1 i lt li p console graphs thold _ settings al Navigate to the Graph Settings page by clicking the Graph tab then by clicking the Settings tab on the right side of the tabbed interface This page allows you to configure the appearance of your graphs and default page settings General subsection Configure general graph options and display formats You can configure the following items from this page Default RRA Default View Mode Default Graph View Timespan Display Graph V iew Timespan Selector Default Graph View Timeshift Allow Graph to extend t
36. en thousand records into CSV format using your filter criteria The following is the information that this filter displays Warning and Alarm Efficiency Efficiency isa measure of how well an application utilizes its stated CPU request It is calculated by dividing the actual number of CPUs used by the requested number of CPUs This measure requires the application to be properly integrated with LSF to report this data Flappingisameasureof job statechanges If ajob changes statetoo often this may indicate a problem in the pre execution or the last execution host to which the job was submitted Optimally the job will change state three times PENDING RUNNING FINISHED Job dependencies Invalid job dependencies Exited jobs Exclusive jobs Interactive jobs Resource type Batch Job Filters Cluster cluster vi User All vi UGroup AIl Status ACTIVE Effic Queue All Host All w HGroup All a Records 30 I Except N A Resource BUILTIN N A M Search Dynamic C cluster Tz go On certain pages you can filter your view of the data by providing a resource string that conforms to theLSF bhosts R command format In some cases the availability of this option is dependent upon the specified cluster name Search field Batch Job Filters User All al UGroup All v Status RUNNING iy Effic All mM ba HGroup All iv Records 30 i Except N A a M Dynamic go clear export The Sear
37. ense name Server Name The name of the FLEXIm license server M ax Licenses The total number of available licenses In Use Licenses The number of licenses in use by License Scheduler projects License Checkouts page Navigate to the License Checkouts page by clicking License Checkouts under the Reports section of the Grid menu bar This page shows information on licenses that are currently checked out licenses Feature Name The license name Vendor Name Thename of the license vendor Version The version of the license User ID The user that checked out the license Host Name The name of the FLEXIm license server Status The current status of the license checkout InU se The number of licenses checked out Date The time of the last change in status Parameters page Navigate to the Parameters page by clicking Parameters under the Reports section of the Grid menu bar This page shows defined configuration parameters for your cluster as defined inthel sb params file Name Name of the parameter used in this cluster Cluster Name of the cluster to which the parameter applies Description Brief description of the parameter Value Configured parameter value Settings tab ttm Grid se _ _ i a SS console i parans thold grid yy syslogs admin settings 30 Platform RTM User Guide General tab Visual tab Monitoring the Cluster Navigate to the Grid Sett
38. es tab to open the Grid Settings Queues page and show or hide certain fields on theJob Info gt By Queue page Jobs tab Click theJobs tab to open the Grid Settings Jobs page and show or hidecertain fields on theJob Info gt Details page As thereis a large quantity of data collected on LSF jobs you will likely want to show only the information that you are most interested in Job Export tab Click the Job Export tab to open the Grid Settings Job Export page and show or hide certain fields on the ob Info gt By H ost and Job Info gt By H ost Group pages Asthereisalarge quantity of data collected on LSF jobs you will likely want to show only the information that you are most interested in Arrays tab Click the Arrays tab to open the Grid Settings Arrays page and customize your view of the ob Info gt By Array page Choose thosefields that are most relevant for your environment 32 Platform RTM User Guide Monitoring the Cluster Job Graphs tab Click the Job Graphs tab to open the Grid Settings Job Graphs page and customize the display of your Job gt Details graph according to your preference This page contains the following fields Graph Columns The number of columns on the page that RTM uses to display a job s graphs Choose either 1 or 2 columns Width The width of ajob s graphs in pixels Height The height of a job s graphs in pixels You may also customize the colors that RTM uses when constructing j
39. essful polling reported as a percentage Last Failed The last time a license failed to check out Cluster Details page Select the Details box in the Cluster dashboard to see a summary of status details of your clusters This page shows the following status information represented by icons Table 1 LIM Status Icon Status amp gt ok lt locked locked user locked window locked master busy lt unavail unlicensed sbatchd is down RES is down Table 2 Batch Status Icon Status i Any closed except for admin closed admin lt unavail unlicensed unreach For a description of the RTM Status icons open the H ost Dashboard select Host in the Dashboards section of the Grid menu bar and view the H ost Status Legend Job Info section The ob Info section is located in the Grid menu bar Platform RTM User Guide 23 Monitoring the Cluster By Host page Navigate to the By H ost page by clicking By Host under theJ ob Info section of the Grid menu bar This page shows information about hosts in a cluster Host Name The name of the host Click a host name to show running jobs for this host on the ob Info gt Details page Cluster The LSF cluster to which this host belongs Type The type of host as defined in the LSF configuration M odel The model of the host as defined in the LSF configuration Load Batch The current Load and Batch status of the host CPU Fac
40. f the system log or service that recorded the log file entry Level The error level of the log file entry Options Actionsthat you can perform on thelogfileentry Y ou can either createa removal rule or create an alert rule based on this entry Viewing Cluster Service Performance and Host Information Graphs Tab Viewing modes _ 7 ola ron ole Y graphs thold grid Click the Graphs tab to view a graphical representation of your cluster status and details Y ou can view graphs using the Tree List or Preview views 34 Platform RTM User Guide Tree view List view Preview view console Monitoring the Cluster graphs theld ai grid sysloga admin ne settings el Wa Click the Graphs tab then the T ree view tab on the right side of the tabbed menu interface to access the graphs in Tree M ode Use this viewing mode to access all graphs as organized by device and cluster in the tree This treeis customizable as are all graphs from the Console tab Devices can be added to the tree and modified from the Console tab W hat you can do Click a root level tree name to view summary information about all devices within that branch view thumbnails Click a tree branch to view host specific information in graph form Besidea graph click the magnifying glass to view more detailed information broken down in different ways If you zoom into a graph you ar
41. f whether you enabled them in this tab You can configure the following on this page Track and highlight jobs that change state frequently job flapping Set job flapping threshold Color to indicate warning state for job flapping Track and highlight jobs that violate the job efficiency threshold Set the job start window Platform RTM only tracks job running for at least this period of time Set the job warning threshold for entire cluster Color to indicate warning state for job efficiency Set the job alarm threshold for entire cluster Color to indicate alarm state for job efficiency 20 Platform RTM User Guide Monitoring the Cluster Number of warning alarm events before issuing a corresponding message Number of clear events prior to issuing a NOTICE event message Track PID levels and generate log message if threshold exceeded Set PID threshold Highlight pending jobs with dependencies Colors to indicate different types of jobs Viewing LSF Cluster and Job Information Grid Tab console syslogs admin Grid menu bar Overview TheGrid menu bar has six sections Dashboards Job Info User Group Info Load Info H ost Info and Reports The following pages in these sections are specific to Platform RTM Dashboards Cluster License and H ost pages These pages provide an overview of LSF cluster license and host health Management By Cluster By Host By Queue By Job These pages allow you to control y
42. fied by the trademarks or service marks of their respective owners http www platform com Company third part license htm Contents About Plato nm RTN serosa S EE E 5 Introduction to Cacti and Platform RTM sssssssesessssssisssirrssirrsssrsssisssinsssrnnsnnnsnnssnnnssnnnt 5 User mterlate siejanorcrcnon iiig aR 6 Monitoring the Cluster deseicccctevsceceettcteelsagacueetvadssacetauineeceeatsteuneatedocuntrandhdedevasutelaadereeatnccaeie 11 Configuring Cluster Management Console Menu Tab ccccccssececeeeeeesteeeeeeeesnaees 11 Viewing LSF Cluster and Job Information Grid Tab cccccccseeeeeeeeeeeeeeeeeeeeeeseeeeeenes 21 Viewing Threshold Information Thold Tab c ccccccceeeeseeeeeeeeeeeeeeeeeeeeeeeeeeeesenaeeeees 33 Viewing UNIX Log File Entries Syslogs Tab cccccececceeeeseeeeeeeceeceeeeeeeseeeessnaaeeeees 34 Viewing Cluster Service Performance and Host Information Graphs Tab 34 Administering Platform RTM sssssssssesssssssssssssessressrssrsessiestirssttnsstnsstnnsttnnstnnnennnnnnnntnnnsnnnnsnnnt 38 Configuring Cluster Interaction Console Menu Tab ccccccccsseeeeeeeeesessteeeeeeeeeesaees 38 Configuring Date Time and License Information Admin Tab cccccccceeeeesteeteeees 45 Configuring Thresholds and Alerts ccccceceseecceeeeeeeeeeeceneeeesaaeeeeeeeeesaeeseeeeeesnaeeeseneees 46 Controlling an LSF Cl ster cccciacesicecusshey ciecensseeenctesd sty le
43. h they cannot zoom in anymore Y ou should restrict this setting to reduce load as the job zoom function is system intensive You should set a smaller window for larger clusters to reduce system load User group filter operation Specify how your cluster handles user group filtering User Group M embership U ser accounts are assigned to a user group Job Specification J obs are assigned to user groups at job submission time by using bsub G Maximum export rows Restrict the maximum number of rows that your users can export to increase system performance Y ou should set less rows to increase system performance Enable cluster CPU factor leveling Important Do not enable this setting unless you understand how to apply CPU factoring to hosts in your cluster Click the Poller tab to open the Grid Settings Poller page and configure poller defaults for data collection interval settings and thresholds You can configure the following on this page Enable daemons Platform RTM User Guide 17 Monitoring the Cluster Maint tab Click the M aint tab to open the Grid Settings Maint page and configure system maintenance settings You can keep more data for smaller clusters because there are less records for these clusters You can configure the following on this page Time when past database records are removed Retention period for job details Records of job details are kept for this period of time after thejob is ended
44. he H ost Queue and J ob level controls using eaut h in the LSF master host to invoke the control actions After saving these settings this user name is created as a disabled Unix local account in the RTM host If you are connecting to the LSF master host usings sh private key authentication you need to provide the private key path pointing to the private key file As shown in the prerequisites the public key of this file is added to the authorized_keys file of the LSF master host root user TheLSF server top directory is the top level LSF installation directory LSF_TOP 6 Click Save to apply your changes Add theRTM host to the LSF cluster as an LSF client Run grid control commands on an LSF cluster 1 Click the Grid tab 2 Under the Management section of the Grid menu bar click the link corresponding with the type of grid control commands you want to run 50 Platform RTM User Guide Administering Platform RTM By Cluster Control cluster level components such as mbat chd LIM or RES By Host O pen or close hosts in the LSF cluster By Queue Control queuesin the LSF cluster By Job Control jobs that are submitted to the LSF cluster Select the checkbox next to at least one item in thelist for which you want to run the grid control command In the Choose an action field select an action and click Go fthegrid control requires additional information specify thisinformation in thedisplayed fields Click
45. hed because the data source exceeds this value the threshold triggers the specified action Low threshold If the threshold is breached because the data source drops below this value the threshold triggers the specified action Norm threshold If the threshold is breached then returns to normal the threshold triggers the specified action 8 Click Save to create your new threshold 46 Platform RTM User Guide Administering Platform RTM Create a threshold from a host Create a threshold using a host as the source 1 Under the M anagement section of the Console menu bar click Thresholds Click Add on the top right side of the Clusters page In the Source field select Host Select the appropriate host name for the new threshold In the Graph field specify the graph for which you want your threshold to monitor au FW N 8 9 Click the Console tab The Data Source field displays followed by the graph that you specified In the Data Source field specify the data source item that you want your threshold to monitor and click Create The Event Triggering sections allow you to configure threshold event triggering which specifies actions commands shell scripts or host level actions to take if the threshold conditions are met High threshold If thethreshold is breached because the data source exceeds this value the threshold triggers the specified action Low threshold If the threshold is breached
46. hold settings including thresholds to identify when resources areidle closed low busy or starved You can configure the following on this page W hen to consider a host idle or closed based on job slots W hen to consider a host idle with jobs based on CPU percentage W hen to consider a host low on resources based on load average W hen to consider a host s low physical memory urgent W hen to consider a host s low swap memory urgent W hen to consider a host s low temp memory urgent The point at which an IO rateis high enough to consider a host low on resources K b sec The point at which a paging rate is high enough to consider a host low on resources pages sec W hen to consider a host busy based on CPU W hen to consider a host busy based on the load average The point at which an 1O rateis high enough to consider a host busy K b sec The point at which a paging rate is high enough to consider a host busy pages sec W hen to consider a host idle based on comparisons of load vs running jobs shows if a host has orphaned or non cpu intensive jobs running W hen to consider ahost idle based on acomparison of load vs running jobs shows if a host may be running jobs outside the grid management system W hen to consider a host starved Platform RTM User Guide 19 Monitoring the Cluster Agg
47. hosts on those clusters If you choose to filter the display the display will be changed to reflect the current filtering If you roll your mouseover a host summary information displays about that host For example you can view load averages numbers of job slots and current slot utilization administrative notes and status If you click a host icon you are directed to the RUNNING jobs for that host on theJ ob Info gt Details page Color coding for the host icons is described under the Host Status Legend section The host icons can appear as either small or large in size Click the Settings tab and modify the settings found under the Visual sub tab to control this behavior The cluster page shows the following information Cluster Name The LSF cluster name Cluster Status The status of the cluster Master Status The status of the master host in the cluster PAU Thetype of the host currently controlling the cluster Valid values are as follows P Primary master host A Failover host U Unknown host type Collect Status The data collection status for the cluster CPU The cluster s overall CPU utilization rate as a percentage Slot The entire cluster s slot utilization as a percentage Efic Theentirecluster s CPU efficiency for running jobs Efficiency is calculated with this formula cou_time run_timex _of_cpus Total CPUs The total number of CPUs in the cluster H
48. ically view the status of devices and services within their infrastructure In recent years with the release of the Cacti Plug in Architecture organizations using Cacti can now extend the Cacti framework to address other needs Platform RTM is onesuch add on to Cacti RTM provides users the ability to view information about their LSF gridsin a graphical way and includes a near real time reporting interface Out of the box Cacti can monitor UPS devices servers services databases network switches SANs and NASs In addition Cacti can record any time series data that can be obtained either through SN MP or a script Using this mechanism Platform RTM provides the ability to view LSF data such as execution hosts users queues and job statistics Together Platform RTM and Cacti providean opportunity for IT organizationsto consolidate monitoring of an entire HPC computing infrastructure Relationship between LSF RTM and FlexLM server RTM is used to monitor and graph LSF resources including networks disks applications etc in a cluster In graph or table formats RTM displays resource related information such as the number of jobs submitted the details of individual jobs like load average cpu usage job owner or the hosts on which the jobs ran FlexLM isa third party license manager used by Platform RTM for license control FlexLM allows licenses to reside on the network instead of a specific host this allows any registered host i
49. in the event of a breach High threshold If the threshold is breached because the data source exceeds this value the threshold triggers the specified action on the host 12 Platform RTM User Guide Monitoring the Cluster Low threshold If the threshold is breached because the data source drops below this value the threshold triggers the specified action on the host Email message body Email alert message content T his specifies the template that is used in alert enail notifications for this threshold Note You can use placeholders to customize your alert emails and provide additional information Placeholders for the email message body are enclosed by angle brackets lt gt for example lt cluster_name gt Syslog settings Datatype Special formatting for the given data Realert cycle The amount of time the threshold repeats the alert if it is still in breach Notify accounts and extra alert emails Email addresses to be notified when the threshold raises an alert Placeholder tags Placeholder name Cluster ID Cluster name Cluster LSF master Cluster LSF version Cluster LSF LIM port Custom data value Host name Host description Placeholders are custom tags that represent real system values Y ou can insert placeholders in threshold names to show customized names based on your system and you can insert placeholders in alert email templates to present additional information for administrators
50. ing the Cluster Client Name The host name of the client Cluster The cluster to which this client belongs First seen The date and time at which this client was first seen in this cluster Last seen The date and time at which this client was last seen in this cluster This pageis very helpful in situations where you are leveraging either submission only clients or floating client configurations Groups page Navigate to the Groups page by clicking Groups under the H ost Info section of the Grid menu bar This page shows basic host information for each host within a host group Group Name Thename of the host group Cluster The LSF cluster to which this host group belongs Host Name The host name of a host belonging to the host group Host Type The type of the host Host M odel The model of the host CPU Factor The CPU factor of the host M ax CPUs Thenumber of processors on the host M ax Mem The maximum amount of physical memory available for user processes on the host G gigabytes M megabytes M ax Swap The total available swap space on the host G gigabytes M megabytes Max Tmp The maximum amount of space in tmp for the host G gigabytes M megabytes Reports section The Reports section is located in the Grid menu bar Daily Statistics page Navigate to the Daily Statistics page by clicking Daily Statistics under the Reports section of the Grid menu bar This page shows daily statistics for your
51. ings tab by clicking the Grid tab then by clicking the Settings tab on theright side of thetabbed interface T hereis atab for each category of grid display settings that you can change Click the General tab to open the Grid Settings General page and define the default page to display as well as the default LSF cluster for which you want to filter data The fields on this page are as follows Your Main Screen The default information page that opens when you click the Grid tab Default Cluster The default cluster name used in the filter on all RTM pages Default Cluster Timezone Whether to display job event times using the time zone of the cluster or the RTM server Default J ob Status The default setting used to initially filter the list of displayed jobs on the Grid page within Job Info gt Details All Job Status Allows the user to choose the ALL option to show alist of all jobs on the Grid page within Job Info gt Details Show Inactive Users Preference to show all LSF users even if they have not run any jobs recently normally they are not shown Audible Alerting Preference for an audible alert to system administrators when an LSF host becomes unavailable or unreachable Blink W hen Down Preference for a host icon to blink when that host is down Support Advanced Popup Enables the display of additional popup content including additional job links This does not work with Internet Explorer version 7 and earlier
52. latform RTM poller 1 Verify the permissions on all file objects within the Platform RTM poller directory They should typically be owned by cacti cacti If you havechosen another method please note those file permissions 2 Execute the following commands to upgrade the pollers cp rp lt upgrade_path gt poller usr local chown R cacti cacti usr local grid Cluster Administration Add clusters to RTM Add any LSF clusters that you want RTM to monitor Add clustersto RTM using the RTM Console e Add clusters to RTM using the command line script Add clusters to RTM using the RTM Console Usethe RTM Console to add an LSF cluster to RTM 1 Click the Console tab 2 Under the Grid Management section of the Console menu bar click Clusters The Clusters page displays 3 Click Add on the top right side of the Clusters page The Cluster Edit new page displays 56 Platform RTM User Guide 4 5 Performance and Maintenance Specify the required fields describing your cluster For the Grid Poller field select the appropriate poller for your version of the LSF cluster Click Create to save the settings for your cluster Add theRTM host to the new cluster as an LSF client Add clusters to RTM using a script Usethegrid add cluster php script to add an LSF cluster to RTM 1 In thecommand line navigateto thep ugi ns grid subdirectory of the cacti installation directory For example cd opt cacti p
53. lert Trigger The amount of time that the data source item must bein breach of the threshold before the threshold triggers an alert Duration If the data source item is still in breach of the threshold this is the amount of time from when the alert was first triggered Repeat The amount of time that the threshold waits before repeating the alert if the data source item is still in breach of the threshold Platform RTM User Guide 11 Monitoring the Cluster Current The current value of the monitored data field Triggered Indicates whether this threshold has trigged an alert Enabled Indicates whether this threshold is currently active Ack Indicates whether the threshold alerts have been acknowledged on indicates that the threshold has been acknowledged off indicates that the threshold either has not been acknowledged or had its acknowledgement reset Thold Item page Navigate to the T hold Item page for a threshold by clicking the name of the threshold from theT hresholds page T his pageallows you to configure threshold settings and event triggering Event triggering behavior is based on re alert cycle settings W hen the threshold first triggers an alert the event trigger is invoked based on a high or low threshold breach If the alert stays triggered the event trigger isinvoked again unless the re alert cycleis set to Never When the alert reverts to normal the threshold triggers the norm threshold command or script Y
54. ller logging levels The version of the SNM P utility installed in the RTM host The version of the RRDTool utility installed in the RTM host SN MP default settings Whether RTM prompts the user before deleting items 42 Platform RTM User Guide Administering Platform RTM Paths tab Click the Paths tab to open the Cacti Settings Paths pageand configureCacti directories and file paths You can configure the following on this page Location of SNMP binary files on the RTM host If the files are found and verified the message OK FILE FOUND appears below these fields Location of the RRDTool binary fileon the RTM host for example user bi n rrdtool lfthefileisfound and verified themessage OK FILE FOUND appearsbelow this field Location of the RRDTool font file on the RTM host If the file is found and verified the message OK FILE FOUND appears below this field Location of the PHP binary fileon theRTM host for example usr bi n php Ifthefile is found and verified the message OK FILE FOUND appears below this field Location of log fileon the RTM host for example opt cacti log cacti og If the file is found and verified the message OK FILE FOUND appears below this field Location of the Spine binary file If the file is found and verified the message OK FILE FOUND appears below this field Poller tab Click the Poller tab to open the Cacti Settings Poller page and configure poller defaults You can configure
55. lugins grid Usephp torunthegrid add cluster php script php q grid_add_cluster php type 0 pollerid sf_type cluster_name cluster_name_text cluster_env sf_envdir_path where Isf_type is an integer representing the version of LSF running in the cluster 1 LSF 6 2 2 LSF 7 0 1 3 LSF 7 0 2 4 LSF 7 0 3 5 LSF 7 0 4 cluster_name_text is the name of the cluster Isf_envdir_path is the path to thel sf conf file for your LSF cluster For example to add an LSF 7 0 1 cluster named mai ncl ust er with sf conf located in share lsf conf php q grid_add_cluster php type 0 pollerid 2 cluster_name maincluster cluster_env share Isf conf AddtheRTM host to the new cluster as an LSF client Verify that the new cluster is added to RTM using theRTM Console by clicking Clusters in the Console menu bar and checking that the new cluster is up Add the RTM host to the LSF cluster as an LSF client For any LSF cluster that RTM monitors you need to add the RTM host to the cluster as an LSF client to give RTM access to LSF cluster data 1 2 Editthe etc hosts fileand add theIP address and host name of your RTM host 3 4 ReconfigureLIM and restart mb at chd to apply your changes to the cluster Log into the LSF master host Editthel sf cluster cluste_namefileand add theRTM host to theHost section Isadmin reconfig Platform RTM User Guide 57 Performance and Maintenance badmin mbdrestart 5 Test that you
56. n Threshold host Not available lt HOSTNAME gt The host name of the threshold name Threshold trigger Not available lt TI ME gt The time in which the threshold time triggered this alert Threshold graph Not available lt URL gt The link to the URL of the threshold URL graph Threshold current Not available lt CURRENTVALUE gt The current value of the data field value being monitored by the threshold at the time of the alert email Threshold name Not available lt NAME gt The name of the threshold Threshold data Not available lt DSNAME gt The name of the data source being source name monitored by the threshold Threshold type Not available lt THOLDTYPE gt The threshold type Threshold high Not available lt HI gt The high threshold boundary value value Threshold low Not available lt L0 gt The low threshold boundary value value Threshold trigger Not available lt TRI GGER gt The threshold trigger value Threshold graph Not available lt GRAPHI D gt The ID of the threshold graph ID Threshold Not available lt DURATI ON gt The duration of the threshold duration Threshold details Not available lt DETAILS_URL gt A URL to the threshold details URL page which is a list of hosts that breached this threshold Threshold Not available lt BREACHED_ TEMS gt A list of items that breached this breached items threshold in an HTML table format Threshold graph Not available lt GRAP H gt The threshold gr
57. n perform the following actions to control queues in an LSF cluster open queues close queues activate queues deactivate queues switch all jobs from one queue to another Job controls You can run the following LSF commands to control jobs in an LSF cluster btop moves a pending job relative to the first job in the queue bbot moves a pending job relative to the last job in the queue bswi tch switches unfinished jobs from one queue to another brun forces jobs to run immediately bst op Suspends unfinished jobs bki II sends signals to kill unfinished jobs bkill r forces ajob kill bki II s sendsaspecific signal to kill a job Enable grid control on an LSF cluster TheLSF cluster and the RTM host must meet the following requirements TheLSF master host is a Linux AIX HPUX or Solaris host with sh or bash installed TheRTM host hasrsh orssh access to the LSF master host TheLSF master host uses at least oneof the following methods of authentication and meets the corresponding requirements Platform RTM User Guide 49 Administering Platform RTM ssh password authenticaion Y ou are asked for the password of the LSF master host root user each time you invoke a cluster control action ssh private key authentication Y ou created an ssh public key pair by running sshkeygen t rsaon theRTM host as root then adding the publickeytotheaut hori zed_keys fileoftheLSF master host root user TheLSF master host has passwo
58. n the cluster to access and use a specific licensed application thereby increasing usage efficiency Platform RTM User Guide 5 About Platform RTM Web Based Portal Ene re Cluster X E RTM Server S RTM monitors multiple LSF J clusters l 4 x er RTM collects information from FlexLM server s FlexLM server s Start and stop FlexLM server Platform RTM is already configured with the FlexLM server If you need to start or stop the FlexLM server navigate to opt fl exl m bin and then run the following script imgrd 1 log lmgrd log c opt rtmetc rtmlic Alternatively copy bi n l mgrd init to etc init d 1 mgr d and then use these commands to start and stop the FlexLM server service I mgrd start service Imgrd stop User interface For the most part Platform RTM follows the design cues from the original Cacti product This section describes the details common to all elements of the Cacti user interface allowing you to more easily navigate its functionality Tabbed interface Thereis a tab for each major area of functionality within the product The following table describes components of Cacti s default user interface configured to include Platform RTM Interface component Description Tab console Opens the Console page Access Cacti and Platform RTM administration functions including graph creation and management templates grid settings and utilities 6 Platform R
59. nfigure the following on this page Authentication method Name of the guest user for viewing graphs Name of the user that Cacti uses as a template for new users LDAP settings EGO authentication settings Alerting Thold tab Click the Alerting Thold tab to open the Cacti Settings Alerting THold page and configure alert and cluster threshold settings You can configure the following on this page Disable all thresholds Base URL of the Cacti server Maximum number of thresholds to display per page Enable logging of threshold failures Enable logging of threshold changes Default alerting options Default baseline options Emailing options Misc tab Click the M isc tab to open the Cacti Settings Misc page and configure syslog event settings You can configure the following on this page Syslog page refresh interval 44 Platform RTM User Guide Administering Platform RTM Syslog event retention period Syslog event email settings RTM Plugins tab Click theRTM Plugins tab to open the Cacti Settings RTM Plugins page and configure RTM plugin settings You can configure the following on this page DNS suffix for the RTM server This setting can be found in thessh ort el net configuration of the RTM host and is only required if your web browser cannot resolve host names ssh terminal window display settings Boost tab Click the Boost tab to open the Cacti Settings Boost page and configure Boost server settings
60. o Future First Day of the W eek Start of the Daily Shift End of Daily Shift Graph Date Display Format Graph Date Separator Page Refresh Graph thumbnails subsection Configure the size of the thumbnails used to represent your graphs You can configure the following items from this page Thumbnail H eight Thumbnail Width Thumbnail Column Thumbnail Selection Tree View Mode subsection Configure the tree defaults and display when in tree view mode You can configure the following items from this page Default Graph Tree Default Tree View M ode Dual PaneTree Width Expand Hosts 36 Platform RTM User Guide Monitoring the Cluster Preview Mode subsection Configure the preview mode display You can configure the following items from this page Graphs Per Page List View Mode subsection Configure the list view mode display You can configure the following items from this page Graphs Per Page Graph Fonts subsection Choose whether to use your own custom fonts and font sizes or the system defaults You can configure the following items from this page Use Custom Fonts Platform RTM User Guide 37 Administering Platform RTM Administering Platform RTM Configuring Cluster Interaction Console Menu Tab console admin Console menu bar Configuration Overview The Console menu bar has eight sections Create graphs M anagement Grid M anagement Collection M ethods Templates Import Export Configurati
61. ob graphs The drop down lists are color coded to facilitate your choice Viewing Threshold Information Thold Tab cons ole admin Thresholds tab Navigate to the Thresholds page by clicking the T hold tab then the Thresholds tab This page shows the configured thresholds in your cluster This page contains the following fields Name The name of the cluster and the threshold Click the name of a threshold with triggered alerts to see a list of the hosts and the specific data source values that triggered the alert ID The D assigned to the threshold Type The type of threshold for example High Low Baseline or Time Based High Thehigh threshold boundary value If thecurrent valueof themonitored datasource is greater than this boundary the threshold triggers an alert Low Thelow threshold boundary value If the current value of the monitored data source is lower than this boundary the threshold triggers an alert Current The current value of the monitored data source Enabled Indicates whether this threshold is currently active Atthe bottom of theT hresholds pagethere arecolor codes that indicate threshold conditions including Alarm W arning Alarm Notice Ok and Disabled Host Status tab Navigate to the H ost Status page by clicking theT hold tab then the H ost Status tab This page shows the status of the hosts that being monitored by a threshold This page contains the following fields Description A
62. of a menu item Click this link to log out of the system Menu bars Menu bars run vertically along the left sides of the Console and Grid tabs Use the Console menu bar and the Grid menu bar to respectively access administration tools and functions or to view information about your LSF Cluster and submitted jobs Note that your RTM administrator may hide or show various menu items you may not have access to all of the areas described Selection filtering Batch Job Filters Within various menus you can filter information that you want displayed For example you might filter by cluster user status etc User All UGroup All Status RUNNING M Effic All M Host All v HGroup AIl v Except N A Search Click the inverted green triangle along the title bar to hide or show the selection filter W hen hidden your viewable area increases but filter options arenot lost O nce filter options are set the displayed information updates to only include the selected items Selection filters operate using the AN D operator In the example above if you select the Status RUNNING and the User John only John s running jobs show in the display area Platform RTM User Guide 7 About Platform RTM Button descriptions Click Go to refresh the page using the current filter criteria Click Clear to return the filters to their default values Certain filters also include an Export button Use this to export as many as t
63. on and Utilities M ost are default Cacti utilities and features and are documented in the Cacti documentation find it here http cacti net documentation php The following section and its corresponding pages is specific to configuring Grid and Cacti settings Configuration Cacti Settings Grid Settings pages These pages allow you to configure anumber of settings related how Platform RTM interacts with your cluster Configuration section Grid Settings page Navigate to the Grid Settings page by clicking Grid Settings under theC onfiguration section of the Console menu bar Thereis atab for each category of grid settings that you can change Contact Platform Computing for assistance in determining the optimal settings for your clusters General tab Click the General tab to open the Grid Settings General page and configure the default user settings for your cluster You can configure the following on this page Domain names to strip from the display This makes the name output shorter for hosts belonging to the specified domain names It allows you to conserve display space if you are displaying hosts in common domains within your organization Summary hostnames to substitute Minimum user screen refresh interval Restrict the minimum refresh interval that your users can set to reduceload on thesystem You should set a higher refresh interval for larger clusters to reduce system load Maximum job zoom time range Rest
64. on of the Console menu bar click Thresholds 3 Click the checkbox at the right side of each threshold with triggered alerts that you want to acknowledge 4 In the Choose an action field select Acknowledge and click Go 5 In the Acknowledge M essage window specify an acknowledgement reason message or leave blank for no message and click Yes to acknowledge the triggered alerts for all thresholds This message is recorded in thecacti 10g thold log database table and syslog files You can repeat the above steps but select Reset Acknowledgement in the Choose an action field to allow the thresholds to resend future email and system log notifications with each triggered alert for the threshold Controlling an LSF Cluster 48 Platform RTM User Guide Administering Platform RTM Controlling LSF clusters Platform RTM allows you to control LSF clusters hosts queues and jobs as long as you enabled RTM to control the LSF clusters RTM controls the LSF c lusters by invoking LSF commands in the LSF master host Cluster controls You can control the following cluster level components mbatchd start restart or shut down LIM start restart or shut down RES start restart or shut down You can also run the following LSF commands badmin reconfig dynamically reconfigures LSF sadmin reconfig restartsLIM on all hosts in the cluster Host controls You can open or close hosts in LSF clusters Queue controls You ca
65. only backs up the file path to the key file 54 Platform RTM User Guide Performance and Maintenance Even though the contents of the tables are not backed up their structure is In order to run the database backup you must first specify a backup location under the Console option Grid Settings gt Maint Onceyou have verified that open a shell and perform the following steps cd var www html cacti plugins grid php q database backup php Running that utility will create three backup files the M ySQL database itself the Cacti RTM Schema and Critical RTM Tables These backup files can be used in the case of a disaster in order to restore full functionality to the system Back up the Platform RTM poller In order to support the ever growing list of features and to provide for bug fixes the Platform RTM pollers will haveto upgraded from timeto time Therefore prior to upgrading Platform RTM you should first make a backup copy of your Platform RTM pollers In order to backup your pollers execute the following commands on each poller or simply backup one poller and distribute the changes to all other pollers cd usr local tar cf grid_poller_backup tar grid Restore the RTM license file TheRTM license fileis automatically backed up as part of the database backup procedure but itis not restored during the database restore procedure To manually restore a license file extract the license file from the backup t gz archive file
66. ost Info section is located in the Grid menu bar Servers page Navigate to the Servers page by clicking Servers under the H ost Info section of the Grid menu bar This page shows information identical to the LSF Ishosts command Host Name The name of the host Cluster The LSF cluster to which this host belongs Type The type of the host as defined in the LSF configuration M odel The model of the host as defined in the LSF configuration CPU Factor The CPU factor of the host as defined in the LSF configuration M ax CPUs The number of processors on this host M ax Memory The maximum amount of physical memory available for user processes G gigabytes M megabytes M ax Swap The total available swap space G gigabytes M megabytes M ax Temp The maximum available space in tmp G gigabytes M megabytes Total Disks The number of local disk drives directly attached to the host Resources The Boolean resources defined for this host denoted by resource names and the values of external numeric and string static resources Clients page Navigate to the Clients page by clicking Clients under theH ost Info section of the Grid menu bar This page shows information similar to the LSF command Ishosts except that it only displays LSF clients In addition to showing current clients that have registered with LSF it shows all prior clients that have performed operations on the LSF Cluster 28 Platform RTM User Guide Monitor
67. ost Slots The total number of slots available to run jobs in the cluster Pend Jobs The total number of pending jobs in the cluster Run Jobs The total number of running jobs in the cluster Susp Jobs Thetotal number of suspended jobs in the cluster including system suspended and user suspended jobs Hourly Started The total number of jobs started during the last hour Hourly Done Thetotal number of jobs completed during the last hour Hourly Exit The total number of jobs aborted during the last hour unsuccessful completion The license page shows the following information Server Name The defined name for the license server Click anameto open the License Usage page and view information on licenses in the license server Vendor The specified software vendor providing services on this license server Location The specified physical location of this license server Poller Name The name of the poller associated with this server Collect Status The current status of the license server Status can beDown Recovering Up and Unknown Event Count The number of events logged Current Time The current time taken for polling this server Average Time The average time taken for polling this server Max Time The maximum time taken for polling this server 22 Platform RTM User Guide Monitoring the Cluster e Availability The availability of the license server based on the percentage of succ
68. ou can configure the following items from this page Template propagation enabled Enable the propagation of changes to the threshold template Threshold name The name of the threshold as it appears in the Name column in the list of thresholds Note You can use placeholders to customize your threshold name Placeholders for the threshold name are enclosed by pipe characters for example cluster_name Threshold enabled W eekend exemption Disable threshold alerts on weekends Disable restoration email Disable threshold alerts when the threshold has returned to normal Reset acknowledgement Reset acknowledgements when the threshold has returned to normal High low threshold values Threshold type High low baseline or time based Breach duration The breach duration before the threshold raises an alert Event triggering Shell command Specifies event trigger commandsor shell scripts in the event of a breach High threshold If the threshold is breached because the data source exceeds this value the threshold triggers the specified command or shell script Low threshold If the threshold is breached because the data source drops below this value the threshold triggers the specified command or shell script Norm threshold If the threshold is breached then returns to normal the threshold triggers the specified command or shell script Event triggering Grid administrator host level triggers Specifies host level actions
69. ou change any of the colors to None the corresponding event is not shown in the legend All of these events are logged to the Cacti log regardless of whether you enabled them in this tab You can configure the following on this page Track and highlight jobs that change state frequently job flapping Set job flapping threshold Color to indicate warning state for job flapping Track and highlight jobs that violate the job efficiency threshold Set the job start window Platform RTM only tracks job running for at least this period of time Set the job warning threshold for entire cluster Color to indicate warning state for job efficiency Set the job alarm threshold for entire cluster Color to indicate alarm state for job efficiency Number of warning alarm events before issuing a corresponding message Number of clear events prior to issuing a NOTICE event message Track PID levels and generate log message if threshold exceeded Set PID threshold Highlight pending jobs with dependencies Colors to indicate different types of jobs Configuration section Cacti Settings page Navigate to the Cacti Settings page by clicking Settings under the Configuration section of the Console menu bar There is a tab for each category of Cacti settings that you can change General tab Click the General tab to open the Cacti Settings General page and configure the default Cacti settings You can configure the following on this page Event logging Po
70. our cluster and are only available to users with the Cluster Control Management realm permission Click User Management under the Utilities section of the Console menu bar Job Info By Host By Host Group By Queue By Array Details pages These pages provide information about LSF jobs either at the level of host host group queue or job group job array Y ou can also view detailed information about specific jobs User Group Info Users page These pages probide information about LSF users Load Info H ost and H ost Group pages These pages provide information about host load and host groups Host Info Servers Clients and Groups pages These pages provide information about LSF cluster servers clients and host groups Reports Daily Statistics License D etails License Checkouts and Parameters pages These pages provideinformation about FlexLM license usage Y ou can filter statistics and batch system parameters for specific information Dashboards section The Dashboard section is located in the Grid menu bar Cluster License and Host pages Together the RTM dashboards display useful information about the status of your LSF clusters By changing theicon color RTM can also alert operators when a host becomes unavailablefor somereason nits current form you can view thestatusof each of your clusters Platform RTM User Guide 21 Monitoring the Cluster the status of feature licenses and a pictorial representation of the
71. r LSF 6 2 clusters and 7869 LSF 7 x clusters The default mbat chd port is 6881 Known issues For alist of the latest known issues refer to the Release N otes for Platform RTM 58 Platform RTM User Guide
72. rchiving tab Click theArchiving tab to open theGrid Settings Archiving pageand configure database archiving settings Data archiving allows deep dive analysis that will not impact the system because you can perform this analysis on the archive database instead of on a database that is currently in use You can configure the following on this page Enable data archiving Frequency of data archiving 18 Platform RTM User Guide Paths tab Thold tab Monitoring the Cluster Database type that will store data archives Name of host receiving data archives Name of the database receiving data archives Database account user name password and port for connecting with the database Enable RRD file creation for archiving during record purging Note Enabling this will result in very large data archives Storage location of archived RRD files Click thePathstab to open the Grid Settings Paths pageand configure cluster directories and file paths You can configure the following on this page Location of log files on poller hosts for example var www html cacti log cacti log Ifthe directory is found and verified themessage OK FILE FOUND appears below this field Location of job rusage RRD and image files for example opt cacti gridcache If the directory is found and verified the message OK DIR FOUND appears below this field Click theThold tab to open the Grid Settings THold pageand configurecluster thres
73. rd less authentication ssh private key authorization or r s h available with all other hosts in the LSF cluster rsh password less authentication The rhosts filein the LSF master host specifies the root user of the RTM host TheLSF master host and theRTM host both havetheincomingT CP port 5140pen TheLSF master host has password less authentication s sh private key authorization or r s h available with all other hosts in the LSF cluster Enable grid control to allow you to control LSF clusters using the RTM user interface 1 Click the Console tab 2 Enablegrid control for each applicable user in the RTM host a Under the Utilities section of the Console menu bar click User Management b Click the name of the user for which you want to enable grid control c In the Realm Permissions section select the Cluster Control Management field if itis currently unchecked 3 Under the Grid Management section of the Console menu bar click Clusters 4 Click thename of the cluster that you want to control The Cluster Edit page displays 5 IntheUser Authentication settings section specify LSF_TOP and thesettings for the LSF administrator account in the LSF master host To ensure that RTM has access to the appropriate LSF commands you must consider the following The specified LSF administrator user name is the name of the LSF adminsitrator account in theLSF cluster for which you areenabling grid control This accountis used by t
74. regation tab Click the Aggregation tab to open the Grid Settings Aggregation page and configure default behavior for project information aggregation host information aggregation and memory tracking You can configure the following on this page W allclock calculation method Set this field for chargeback calculations depending on whether you charge for suspend time Enable the tracking of submitted project names Enabling this allows you to collect job data based on project names Track where jobs are submitted from Indicate aggregation method for collected project names Project aggregation is used for project names that contain hierarchical metadata to assist with tracking Starting string position digit during collection Delimiter for separating hierarchy levels Number of significant delimiter fields Track license project job performance Track job memory resources If enabled Platform RTM maintains an internal table of memory performance stastistics Note Only enable this setting if you know how to access and use the internal table Status Events tab Click the Status Events tab to open the Grid Settings Status Events page and configure default behavior thresholds and visual cues for job flapping cluster and job efficiency PID levels and job dependencies If you change any of the colors to None the corresponding event is not shown in the legend All of these events are logged to the Cacti log regardless o
75. rict the maximum job timein which your users can zoom in after which they cannot zoom in anymore Y ou should restrict this setting to reduce load as the job zoom function 38 Platform RTM User Guide Poller tab Maint tab Administering Platform RTM is system intensive You should set a smaller window for larger clusters to reduce system load User group filter operation Specify how your cluster handles user group filtering User Group M embership U ser accounts are assigned to a user group Job Specification J obs are assigned to user groups at job submission time by using bsub G Maximum export rows Restrict the maximum number of rows that your users can export to increase system performance Y ou should set less rows to increase system performance Enable cluster CPU factor leveling Important Do not enable this setting unless you understand how to apply CPU factoring to hosts in your cluster Click the Poller tab to open the Grid Settings Poller page and configure poller defaults for data collection interval settings and thresholds You can configure the following on this page Enable daemons Click the M aint tab to open the Grid Settings Maint page and configure system maintenance settings You can keep more data for smaller clusters because there are less records for these clusters You can configure the following on this page Time when past database records are removed Retention period for job de
76. riod from the previous week display Page navigation lt lt Previous Using the navigation bar you can move from page to page within a display area Showing Rows 1 to 30 of 40 1 2 Next gt gt Option descriptions Click lt lt Previous to return to a previous page in the list Click Next gt gt to move forward a page Click any page number in the center of the bar to immediately go to that page Headers and sorting Host cPuU CPU RunQ Max Num Run SSUSP USUSP Reserve Actions Name Cluster Type Model Load Batch Fact Pct im Slots Slots Slots Slots Slots Slots Click a column heading to sort the contents of the display area based on your selection The default sort order is controlled at the system level and is biased towards the most likely sort order for that information Clicking twice on a column heading reverses the sort order Some columns may not appear sortable this is the normal behavior Action icons M any pages have an Actions column in the header Under this column are various icons indicating the type of action available from within this page Some common action icons are described in the following table Icon Description Icon Description View queues A View host job detail i y View users View graphs a T ig View batch hosts gt Zoom into graph k fy View batch host groups vt Display jobs in range Platform RTM User Guide 9 About Platform RTM Icon Description Icon Description Eoo b FE G
77. running jobs outside the grid management system W hen to consider a host starved Aggregation tab Click the Aggregation tab to open the Grid Settings Aggregation page and configure default behavior for project information aggregation host information aggregation and memory tracking You can configure the following on this page W allclock calculation method Set this field for chargeback calculations depending on whether you charge for suspend time Enable the tracking of submitted project names Enabling this allows you to collect job data based on project names Track where jobs are submitted from Indicate aggregation method for collected project names Project aggregation is used for project names that contain hierarchical metadata to assist with tracking Starting string position digit during collection Delimiter for separating hierarchy levels Number of significant delimiter fields Track license project job performance Track job memory resources If enabled Platform RTM maintains an internal table of memory performance stastistics Note Platform RTM User Guide 41 Administering Platform RTM Only enable this setting if you know how to access and use the internal table Status Events tab Click the Status Events tab to open the Grid Settings Status Events page and configure default behavior thresholds and visual cues for job flapping cluster and job efficiency PID levels and job dependencies If y
78. s aggregate information for the job array as a whole Theinformation shown on this page is as follows Array ID Thejob array ID Job Name The name of the job User ID The identifier of the user who submitted the job array Total Jobs The total number of jobs in the job array Pending Jobs The number of jobs that remain pending in the job array Running Jobs The number of currently running jobs DoneJobs The number of jobs completed without error Exit Jobs The number of jobs where errors prevented the job from completing Array Effic The average CPU efficiency of jobs in the job array Platform RTM User Guide 25 Monitoring the Cluster Avg Memory The average memory used by jobs in the array Avg Swap The average swap space used by jobs in the array Total CPU Time The total CPU time used by all started jobs in the job array Details page Navigate to the Details page by clicking Details under the Job Info section of the Grid menu bar Filter batch job information to view only the job types you are interested in Clear the Dynamic check box if you do not want to immediately update page information each time you change a filter setting and instead want to wait until you complete all filter settings and then click go Theinformation shown on this page is as follows JobID ThejobID that LSF assigned to thejob Click ajob number to view an information page containing details about that job including general job information
79. splayed you may change your presentation from thesmaller icons to thelarger icons and viceversa Host Status Popup Transition H ost Count W hen this number of hosts is exceeded RTM disables Javascript popups to allow the screen to refresh faster Number of Records to Display When viewing records in tabular form throughout the RTM interface you can display varying amounts of data This setting provides each interface a default number of records to display Timespans tab Click the Timespans tab to open the Grid Settings Timespans page and control how your grid summary icons appear Each user can display thresholds that they believe represent how their grid hosts are behaving This page contains the following fields Default Grid View Timespan When viewing jobs in the RTM Interface indicates which default Timespan isin effect N ote that this setting only applies when viewing the following statu s All Done and Exit Default Graph View Timeshift The default time shift when viewing job details Allow Graph to extend to Future First Day of the W eek Used for the dayshift timespan Start of Daily Shift U sed for the dayshift timespan End of Daily Shift U sed for the dayshift timespan Clusters tab Click the Clusters tab to open the Grid Settings Clusters page and customize your view of the Dashboards gt Cluster page Choose to show or hide various information columns Queues tab Click the Queu
80. st as specified in the LSF configuration M odel The model of the host as specified in the LSF configuration Status The current status of the host RunQ 15 sec The exponentially averaged effective CPU run queue length of the last 15 seconds RunQ 1 min The exponentially averaged effective CPU run queue length over the past 1 minute RunQ 15 min The exponentially averaged effective CPU run queue length over the past 15 minutes CPU Thecurrent CPU utilization rate Page Rate The memory paging rate exponentially averaged over the last minute in pages per second 1 O Rate The disk I O rate exponentially averaged over the last minute in kilobytes per second Cur Logins Thenumber of current login users IdleTime On Unix theidle time of the host in minutes On Windows the time a screen saver has been active on the host Temp Avail The amount of free space in tmp G gigabyte M megabyte Swap A vail The amount of swap space available G gigabyte M megabyte Mem Avail The amount of physical memory available G gigabyte M megabyte Group Load page Navigate to the Group Load page by clicking H ost Group under the Load Info section of the Grid menu bar This page shows host performance information aggregated to the level of LSF Host Group Group Name Thenameof thehost group Click a group nameto go to the H ost Load page and view information similar to running the LSF command Isload Cluster The cluster to
81. t The CPU factor of the host as defined in the LSF configuration CPU Pct Thecurrent CPU utilization on the host RunQ 1m The exponentially averaged effective CPU run queue length for this host over the last minute M ax Slots The maximum number of job slots that can be allocated to this host Num Slots Thenumber of jobs slots used by jobs dispatched to this host Run Slots The number of job slots used by jobs running on this host SSU SP Slots The number of job slots used by system suspended jobs on the host USU SP Slots The number of job slots used by user suspended jobs on the host Reserve Slots The number of jobs slots used by pending jobs that have job slots reserved within the host If graphs have been created for this host a graph icon appears to the left of the host name Click the icon to view graphs for the host By Host Group page Navigate to the By Host Group page by clicking By Host Group under the ob Info section of the Grid menu bar In many respects this page shows information similar to that obtained usingbhosts with condensed host groups The Status filter is populated with all unique Load and Batch statuses currently experienced by hosts in any cluster This page shows job information by LSF host group Host Group The name of the LSF host group Click a host group name to show running jobs for this group on the ob Info gt Details page Cluster The LSF cluster to which this host group belongs
82. tails Records of job details are kept for this period of time after thejob is ended Thesize of each job record depends on job volume and your cluster settings The system can hold a maximum of 10 million records Use this upper limit along with the approximate number of jobs per week in your cluster to determine the ideal retention period Retention period for individual job records Individual job records are kept for this period of time after the job ended The size of each job record depends on job volume and your cluster settings Retention period for daily summary statistics Record of daily summary statistics are kept for this period of time after the job ended As these records are added every day you can keep records for a longer period of time depending on thejob volume Smaller clusters with less than one million jobs per year can have a retention period as high as three years Maximum number of database records to remove Maximum down time for daemons disabled for maintenance purposes Enable database backup Platform RTM User Guide 39 Administering Platform RTM This enables a disaster recovery backup to restore your Cacti and RTM configuration Some job data is lost during the database restoration though you can use other utilities to restore all the job data Note Database backup files are disk intensive for larger clusters Database backup schedule Number of database backups to maintain Database backup file location
83. user Click a user name to show details of running jobs submitted by this user on the ob Info gt Details page M ax Slots The maximum number of job slots that can be processed concurrently for the specified user s jobs Num Slots The current number of job slots used by the specified user s jobs 26 Platform RTM User Guide Monitoring the Cluster Started Slots The number of slots used by jobs submitted by this user and started by LSF Started jobs can either be running system suspended or user suspended Pending Slots The number of job slots used by the user s pending jobs Running Slots Thenumber of job slots used by the user s running jobs Effic The average CPU efficiency for jobs submitted by this user Sys Susp Slots The number of job slots used by the user s system suspended jobs User Susp Slots The number of job slots used by the user s user suspended jobs Reserve Slots The number of job slots used by the user s pending jobs Load Info section The Load Info section is located in the Grid menu bar Host Load page Navigate to the Host Load page by clicking H ost under the Load Info section of the Grid menu bar This page shows information that is similar to the LSF command Isload Host Name The name of the host Click a host name to show details about jobs running on that host on theJob Info gt Details page Cluster The LSF cluster to which this host belongs Type The type of the ho
84. verage and maximum run time of jobs in that queue as well as the average and maximum pending time for the queues Theinformation shown on this page is as follows Queue name The name of the LSF queue Click a queue name to show running jobs in this queue on the ob Info gt Details page Cluster name The LSF cluster to which this queue belongs Priority The priority of the queue Status Reason The current status of the queue with further detail about the status M ax Slots The maximum number of job slots that can be used by the jobsin the queue Num Slots The total number of available slots for this queue Run Slots The number of job slots used by running jobs in the queue Pend Slots The number of job slots used by pending jobs in the queue Suspend Slots The number of jobs slots used by suspended jobs in the queue AVG Pend The average number of job slots held by pending jobs in the queue MAX Pend The maximum number of job slots held by pending jobs in the queue AVG Run The average number of job slots held by running jobs in the queue MAX Run The maximum number of job slots held by running jobs in the queue If you select any queue you will be directed to a display of all RUNNING jobs within that queue By Array page Navigate to the By Array page by clicking By Array under theJob Info section of the Grid menu bar This page shows information similar to the command bj obs A lt job id gt but also include

Download Pdf Manuals

image

Related Search

Related Contents

sat-nms RMC Radiometer Controller User Manual  Drücken Sie - Farben    Standard deutsch  Samsung SPH-V4200 User Manual  Channel Plus H612 User's Manual  Manuel d`utilisation  Manual de Instalação Token GD StarSign  Revues dépouillées en janvier - Centre Rectoral de Documentation  Sony A3802LENSBDL Precautions  

Copyright © All rights reserved.
Failed to retrieve file