Home
Mellanox Care Quick Start Guide.book
Contents
1. 6 Chapter2 Mellanox Care 7 2 1 Mellanox Care Communication 7 2 27 Network Security xat eet st s ona Res al eda donate 7 Chapter3 Installing Mellanox Care 9 3 1 Installation 9 3 1 1 Mellanox Care Server Resource Requirements per Cluster 512 9 3 1 2 Required Customer 10 Chapter 4 Getting Familiar with Mellanox Care Web 1 11 4 Mellanox Care UI Navigator 11 4 2 Mellanox Care Tabs 12 42 1 The Settings xe eet ett eire ee pue e PIE 12 4 2 1 1 The General Panel Internal Structure 12 4 2 1 2 The E mail Panel Internal Structure 12 4 2 1 3 The Reports Panel Internal 13 4 2 1 4 The Remote Folder Internal 5 14 4 2 2 The Fabrics Labs 1 PUES 15 42 2 T The Manage Paneli aedem eas dae eq vs 15 4 2 2 2 The Health Engine 16
2. 19 Table 7 Mellanox Daily Report Derived Log Files 21 Table 8 Daily Report Table Information 22 Table9 Mellanox Care Events Configuration 1481 26 4 Mellanox Technologies 1 0 About This Manual This document describes the features performance and configuration of the Mellanox Care application Intended Audience This Mellanox Care Quick User Manual is intended for server and network administrators that would like to set Mellanox Care central management service Document Conventions The following conventions are used in this document NOTE Identifies important Information that contains helpful suggestions IN CAUTION Alerts you to the risk of personal injury system damage or loss of data WARNING Warns you that failure to take or avoid a specific action might result in personal injury or a malfunction of the hardware or software Be aware of the hazards involved with electrical circuitry and be familiar with standard practices for preventing accidents before you work on any equipment gt b Mellanox Technologies 5 J Rev 1 0 Mellanox Care Overview 1 Mellanox Care Overview Mellanox Care service is an advanced management service which provides around the clock monitoring tool accompanied by expert troubleshooting analysis of the customer s InfiniBand and f
3. Customer Name Mellanox Care customer name Mellanox Care Server IP Mellanox Care server IP address Mellanox Care SSH Username Username of Mellanox Care server Mellanox Care SSH Password Password of Mellanox Care server Installation Type of installation read only Figure 2 The General Panel Internal Structure General Customer Name John Customer Mellanox Care Server IP 00 000 00 00 Mellanox Care SSH Username jcust Mellanox Care SSH Password Installation Changes must saved before existing any tab otherwise will be deleted 4 2 1 2 The E mail Panel Internal Structure The e mail panel Includes the following e mail settings provided by the customer SMTP Server According to the e mail account provided by the customer SMTP Port Can be 25 or 465 or any other port SMTP Username According to the e mail account provided by the customer SMTP Password According to the e mail account provided by the customer Mail Sender Name of e mail sender same as SMTP username unless the customer provided another one 12 Mellanox Technologies Rev 1 0 Use Authentication Select when the SMTP server requires authentication Use SSL Select when the SMTP server requires secured communication Figure 3 The E mail Panel Internal Structure Email SMTP Server Xxxx company co il SMTP Port 25 SMTP Username jcust mel
4. Please note that any cases that are still open will be handled by Mellanox and you will be updated accordingly as to the ongoing status of the We hope that you are satisfied with the service you are receiving from Mellanox Care If you require any additional assistance or if you would like to provide feedback on your experience please contact Mellanox contact details Email support mellanox com Phone 1 408 916 0055 Sincerely Mellanox Care Mellanox Care With You Every Step of the Way 24 Mellanox Technologies J 1 0 7 Third Party Alarms In addition to events given by the health provider UFM Mellanox Care supports also external events that were generated by third party utilities The external events can be used to integrate Mellanox Care with third party tools that discover alarms which are not part of the standard model of the basic health provider Third party alarms work as follow 1 The third party tools write their output files into the relevant site external events directory For example a third party tool for site1 should direct its output files to opt mlnxcare external sitel These files name shall be of the format timestamp Y m d SH M S gt 3rd party tool name out e g 2014 10 10 10 10 10 testl out 2 Mellanox Care reads the external event files during its run and triggers an event on the health provider One event will be triggered per utility for
5. 42 3 Tab ss sa neo unpas Sand Hen NC mS 17 Chapter 5 Configuring Mellanox Care 18 5 1 Mellanox Care Devices Configuration 18 Chapter 6 Mellanox Care Report 19 6 17 Gase Report c ese e eee e Pres oe lh Salas eC er N ated 19 6 1 1 Case Reports Derived Log 19 6 27 Daily Report sas Dese eden P P CO RUM D hen Had 20 6 2 1 Daily Report Derived Log 21 6 2 2 Daily Report Table Information 22 6 3 Monthly Reports usos 23 Chapter 7 Third Party Alarms 25 Appendix A Mellanox Care Events Configuration 4 26 Mellanox Technologies 3 J Rev 1 0 List Of Tables Table 1 Mellanox Care Communication Protocols 7 Table 2 Installation 9 Table 3 Mellanox Care Server Resource Requirements per Cluster Size 9 Table 4 Navigator Tabs i x REIN Ea A a Re su GS 11 Table 5 Mellanox Care Devices Configuration 18 Table 6 Mellanox Care Case Derived Log Files
6. 09 00 03 467 INFO Message html body was generated 2014 04 10 09 00 03 438 ERROR Site QA failed to connect to site providers due to could not connect 10 209 24 175 U 0002c90300fc7780 2014 04 10 09 00 03 438 CRITICAL UFMDriver 10 209 24 175 failed to connect to UFM over SSH due to Unknown SSH acce 2014 04 10 09 00 03 438 WARNING UFMDriver missing access point attribute hostname for SSH connection to device 0002c 2014 04 10 09 00 03 409 INFO UFMDriver 10 209 24 175 retrieving SSH access Anaqa a 4373 cai eee Ay Changes must be saved before existing any tab otherwise will be deleted Mellanox Technologies 17 Rev 1 0 Configuring Mellanox Care 5 Configuring Mellanox Care 5 1 Mellanox Care Devices Configuration Table 5 Mellanox Care Devices Configuration Field Description GUID Device GUID Access Point Type SSH Secure Shell IP Device IP Port The port being used by the access point type valid input default 22 Username Device username Credentials Device password 18 Mellanox Technologies Rev 1 0 6 Mellanox Care Report Types There are five types of reports that Mellanox Care sends automatically Case reports when a new critical alarm is found in UFM the system sends case reports Monthly report a summary of the last Mellanox care scans during the last month including the amo
7. 10 5 2014 01 29 14 30 01 1 10 No 6 2014 01 29 15 00 02 1 10 No 7 2014 01 29 15 30 01 1 10 No 8 2014 01 29 16 00 02 1 10 2014 01 29 16 30 01 1 10 10 2014 01 29 17 00 02 1 10 No 11 2014 01 29 18 00 01 1 10 1 12 No 12 2014 01 29 18 30 01 1 10 3 28 11 13 2014 01 29 19 30 01 1 10 k 12 No 14 2014 01 29 20 00 01 1 10 2 27 8 No 6 3 Monthly Report The monthly report contains the following information The Subject field contains Subject Name Monthly Report From MCare customer month The Customer s name Timestamp The message field contains A table summery of cases listed according to site switches and servers Atable summery of the list of case opened in the last month including date and number of cases Mellanox Technologies 23 Rev 1 0 Mellanox Care Report Types Figure 9 Mellanox Care Monthly Report Example Dear Customer This is an automatic e mail from the Mellanox Care Service As part of your Mellanox Care service we are sending you this monthly report of the activities performed by Mellanox during the past month The following cases opened during the last month Site Switches Servers sitet 5 Date Number of Cases 2013 12 06 01 03 10 Total
8. 2 6 Mellanox Care Server Resource Requirements per Cluster Size The following resource prerequisites are relevant only to customers that already have UFM installed in their cluster Table 3 Mellanox Care Server Resource Requirements per Cluster Size Fabric Size E c E Disk Space Requirements Minimum Recommended Up to 1000 4 core server 4GB 20GB 80GB 1000 5000 8 core server 16GB 40GB 120GB 5000 10000 16 core server 32GB 80GB 160GB Above 10000 Consult with Mellanox Support nodes Rev 1 0 Installing Mellanox Care 3 1 2 Required Customer Information The following information must be provided prior to the installation of Mellanox Care An e mail account to be used for sending e mails to Mellanox Care NOC Incase there is more than one site for the same account a separate e mail should be cre ated for each site Acsv file containing the device credentials This file will be used during the Mellanox Care deployment in order to access the relevant node and collect all logs the information listed in the questionnaire which will be sent by the project delivery manager before deploying the application at the customer s site MellanoxCare mellanox com should not be added to the recipients lists in the Reports page of the web User Interface 10 Mellanox Technologies Rev 1 0 4 Getting Familiar with Mellanox Care Web UI Mellanox Care is configured throug
9. 4 114 Port Receive PM PORTRCVREMO Critical 75 300 PortRcvRemotePhysicalErrors counter rate threshold Remote Physical TEPHYSICALER exceeded Threshold is 4 received value is Errors RORS 117 Port Xmit Con PM PORTXMITCON Critical 75 300 PortXmitConstraintErrors counter rate threshold exceeded straint Errors STRAINTERRORS Threshold is 4 received value 1 4 118 Port Receive PM PORTRCVCONS Critical 75 300 PortRcvConstraintErrors counter rate threshold exceeded Constraint Errors TRAINTERRORS Threshold is 4 received value 1 4 119 Local Link Integ PM LOCALLINKINT Critical 5 300 LocalLinkIntegrityErrors counter rate threshold exceeded rity Errors EGRITYERRORS Threshold is 4 received value is vod 120 Excessive Buffer PM EXCESSIVEBUF Critical 75 300 ExcessiveBufferOverrunErrors counter rate threshold Overrun Errors FEROVERRUNER exceeded Threshold is 96d received value is 4 RORS 122 Congested Band PM XMITWAITERR Critical 10 300 Congested Bandwidth in percents threshold exceeded width 96 OR Threshold is 4 received value is 4 Threshold Reached 130 Non optimal link PHY NON OPTIMA Critical 1 7200 Found s link that operates in s width mode Mellanox Technologies 26 Rev 1 0 N 27 Table 9 Mellanox Care Events Configuration List Event ID Event Name Event code name Severity Threshold TTL Event Description 13
10. example the files 2014 10 10 10 10 10 testl out 2014 10 10 10 10 11 test1 out that were both generated by third party tool called test1 will trigger one event on the health provider of sitel This event will be also saved to Mellanox Care database so it will be ignored on the next run of Mella nox Care 3 Mellanox Care attaches the external event files to the case sent to Mellanox Support together with the rest of the files collected by the Mellanox Care from its health providers Figure 10 Third Party Alarms Workflow Mellanox j 1 uf a du E Mellanox Technologies 25 Rev 1 0 Appendix Mellanox Care Events Configuration List Table 9 Mellanox Care Events Configuration List width L WIDTH Event ID Event Name Event code name Severity Threshold TTL Event Description 110 Symbol Error PM_SYMBOLERROR Critical 200 300 Symbol Error counter rate threshold exceeded Threshold 18 d received value is 4 111 Link Error PM_LINKERRORRE Critical 1 300 Link Error Recovery counter rate threshold exceeded Recovery COVERY Threshold is 4 received value is 4 112 Link Downed PM LINKDOWNEDC Critical 4 600 Link Downed counter rate threshold exceeded Threshold OUNTER is d received value is 4 113 Port Receive PM PORTRCVERRO Critical 75 300 PortRcvErrors counter rate threshold exceeded Threshold Errors RS 18 d received value is
11. snapshot Cfg2html Fabric Health report UFM Health report Sm log Event log Ufm log ibdiagnet ufmhealth log vsysinfo and pol icy csv Mellanox Technologies 19 J 1 0 Mellanox Care Report Types 6 2 20 Mellanox Technologies Table 6 Mellanox Care Case Derived Log Files Source Description Server System snapshot and Cfg2html Mellanox Switch Debug generate dump Voltaire Switch ExportLogs Mellanox Care Server Mellanox care log and summary log Figure 7 Mellanox Care Case Example Report Subject lt CNT_PASSCODE gt Auto Case lt MCare Case ID gt Opened by MCare lt CUSTOMER NAME gt TIMESTAMP Message Case files on FTP ftp 193 47 165 178 case 1 Switch 0008 10500204583 System Module Error 2014 01 30 11 54 07 062 Module sPSU1u 2 on v sup sw02 4036E 10 240 10 43 statusis dcFault Module 0002 903007 5720 1007 01 Informative Notification 2014 01 30 11 54 06 593 Module Temperature threshold was exceeded Threshold is 10 received value is 29 Module 0008 10500204583 1007 01 Informative Notification 2014 01 30 11 54 07 065 Module Temperature threshold was exceeded Threshold is 10 received value is 37 Module 0008 10500750378 0 18 Informative Notification 2014 01 30 11 54 16 373 Module Temperature threshold was exceeded Threshold is 10 received value is 27 Daily Report To ensure that Mellanox Care service is running a continuous dai
12. 4 T4 Port Con PM TA4XMITWAITER Critical 10 300 Congested Bandwidth in percents threshold gested Bandwidth ROR exceeded Threshold is Vos received value is 5 135 T4 Port Normal PM TANORMALIZE Critical 10 350 T4 Normalized Transmit Wait counter threshold exceeded ized Transmit XW Threshold is s received value is 5 Wait 252 License expired UFM LICENSE EXPI Critical 1 7200 s License has expired Please restart UFM server RED 254 License Limit UFM LICENSE LIMI Critical 1 7200 Managed fabric size s Please refer to your system Exceeded T dor representative to update your license 259 Bad P_Key SM BAD PKEY EX Critical 1 300 Key switch external port 1 lid lid d Switch External T portn d 08 port2 lid lid2 d Port portn2 d 271 ISBL LAG Port ISBL LAG PORT UP Critical 1 7200 ISBL s port up Up 272 ISBL LAG Port ISBL LAG PORT D Critical 1 7200 ISBL s port down Down OWN 273 LAG Port Up LAG PORT UP Critical 1 7200 LAG port up 274 LAG Port Down LAG PORT DOWN Critical 1 7200 LAG port down 275 Port Up PORT UP Critical 1 7200 Port s up 276 Port Down PORT DOWN Critical 1 7200 Port s down 277 Port of LAG Up PORT OF LAG UP Critical 1 7200 Port s of LAG up 278 Port of LAG PORT OF LAG DO Critical 1 7200 Port s of LAG down Down WN 279 Port of ISBLUp PORT OF ISBL UP Critical 1 7200 Port s of ISBL up 280 Port of ISBL PORT OF
13. ISBL DO Critical 1 7200 Port s of ISBL down Down WN 301 Logical Server STATE CHANGED Critical 1 7200 Logical Server changed state from 5 to s State Changed Mellanox Technologies Rev 1 0 Table 9 Mellanox Care Events Configuration List Event ID Event Name Event code name Severity Threshold TTL Event Description 328 Link is Up LINK UP Critical 1 7200 Link is up s 329 Link is Down LINK_DOWN Critical 1 7200 Link went down 5 372 Number of Gate GW_VOL10G_NUM_ Critical 1 7200 Change in the number of 1OGbE Gateways has been ways is Changed ROUTERS CHANGE detected in interface s new number is s 381 Switch Upgrade SW_UPGRADE FAIL Critical 1 7200 Software upgrade on switch s Vos failed Error ED 392 Module Tempera MODULE TEMPERA Critical 10 300 Module Temperature threshold was exceeded Threshold is ture Threshold TURE EXCESS d received value is 4 Reached 394 Module status MODULE STATUS F Critical 1 8600 Module s s on s s status is 5 FAULT AULT 0 512 SM Failover SM_FAILOVER Critical 1 300 SM Failover New SM is running on s GUID s 514 SM LID Change SM LID CHANGE Critical 1 300 SM lid of port guid 016x is changed 517 Fabric Health FABRIC_HEALTH_R Critical 1 1800 FabricHealth Report completed with s Errors and s Report Error EPORT_ERROR Warnings 518 UFM related pro UFM_PROCESS_DO Crit
14. Mellanox TECHNOLOGIES Connect Accelerate Outperform Mellanox Care User Manual Rev 1 0 www mellanox com 1 0 THIS HARDWARE SOFTWARE OR TEST SUITE PRODUCT PRODUCT S AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES AS IS WITH ALL FAULTS OF ANY KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE THE PRODUCTS IN DESIGNATED SOLUTIONS THE CUSTOMER S MANUFACTURING TEST ENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCTO S AND OR THE SYSTEM USING IT THEREFORE MELLANOX TECHNOLOGIES CANNOT AND DOES NOT GUARANTEE OR WARRANT THAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST QUALITY ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT ARE DISCLAIMED IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES FOR ANY DIRECT INDIRECT SPECIAL EXEMPLARY OR CONSEQUENTIAL DAMAGES OF ANY KIND INCLUDING BUT NOT LIMITED TO PAYMENT FOR PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY WHETHER IN CONTRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY FROM THE USE OF THE PRODUCT S AND RELATED DOCUMENTATION EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE Me
15. Support Y UFM Server IP 00 000 00 00 Hm Changes must be saved before existing any tab otherwise will be deleted 16 Mellanox Technologies Rev 1 0 4 2 3 The Logs Tab The Logs panel displays the last 1000 lines of opt mlnxcare log mlnxcare log file EER 2 5 2014 04 10 09 00 40 453 INFO Cleaning 2014 04 10 09 00 40 432 INFO Starting SysdumpCollector to collect logs from notified devices 2014 04 10 09 00 40 432 INFO Site Support health provider 10 240 10 25 finished no notifications were found 2014 04 10 09 00 40 383 INFO UFMDriver 10 240 10 25 receiving alarms 2014 04 10 09 00 40 353 INFO UFMDriver 10 240 10 25 receiving delta traps 2014 04 10 09 00 40 275 INFO Site Provider Support starting 2014 04 10 09 00 40 274 INFO Site Support retrieving data from site provider 2014 04 10 09 00 40 048 INFO SSHTransport 10 240 10 25 establishing a connection 2014 04 10 09 00 40 014 INFO UFMDriver 10 240 10 25 retrieving SSH access 2014 04 10 09 00 39 855 INFO UFMConnector 10 240 10 25 connection established 2014 04 10 09 00 37 550 INFO UFMConnector 10 240 10 25 attempting to connect 2014 04 10 09 00 37 542 INFO Creating ufm provider for site Support 2014 04 10 09 00 37 542 INFO Site Support connecting to site providers 2014 04 10 09 00 37 542 INFO New case message was sent Case ID 20 2014 04 10
16. abric Mellanox Care monitors all switch gateways and servers for critical events including errors on the physical fabric level configuration changes performance monitoring errors communication errors and other device related events that could affect the status of temperature power and hardware modules Deploying this application enables Mellanox to provide a more efficient and personalized support experience Mellanox Care service is based on a proactive Care platform that automatically samples Events Alarms data from Mellanox Health Engine and checks if there are any critical alarms reported Once a case is open Mellanox NOC experts are committed to analyze the information and decide on a course of action in order to solve all issues immediately and keep the InfiniBand fabric up and at peak performance at all times 1 4 Mellanox Care Benefits Optimizes Fabric Maximizes the performance and uptime of the fabric while minimiz ing costs from unexpected malfunctions thus improving the ROI Saves Precious Time Monitors your fabric 24 7 and quickly alerts you of any potential problem thus allowing your staff to focus on the mission critical aspects of the cluster Maximizes Uptime Quick expert identification and resolution of fabric issues on top of preventive monitoring avoids long downtime and troubleshooting periods Improves Fabric Reliability Enhances your fabric serviceability and reliability while offering the best user
17. b UI Ref ID code of daily report To update Ref ID field please refer to Updating Ref ID Field on page 29 mlnxcare mellanox com must not be added to any of the recipients fields list in the Reports Panel ae Figure 4 The Reports Panel Internal Structure Reports Case Recipients jjcust company com Daily Report Recipients jjcust company com Monthly Report Recipients jcust company com 9 Scan Interval 00h30m Y Daily Report Time 0000 v Daily Report Clear Alarms Monthly Report Fabric Health Report On Case Collect Log Files From Switches Collect Log Files From Servers Send to Mellanox Support CNT Passcode Ref ID ref id 9 4 Changes must be saved before existing any tab otherwise will be deleted 4 2 1 4 The Remote Folder Internal Structure The Remote Folder panel includes the following configurations for the FTP folder Protocol FTP or SFTP Server The IP or hostname of the FTP server for Mellanox FTP use 139 47 165 178 Path The path for the file location has to begin with Username The FTP server Username 14 Mellanox Technologies 1 0 Figure 5 The Remote Folder Internal Structure Remote Folder Protocol FTP SFTP Server 000 00 000 000 Path 9 Username MinxCareDebug Password inne M Changes must be saved before existing any tab otherwise will be deleted 4 2 2 The Fa
18. brics Tabs The Fabrics tab includes the following panels Manage Health Engine 4 2 2 1 The Manage Panel The Manage panel enables you to add or remove fabrics in the Mellanox Care application The Manage panel table lists all the customer fabrics monitored by Mellanox care To add a Fabric 1 Click Add Step2 Add a for the fabric and a description open text each fabric name must be unique i e it cannot be used twice Toremove a Fabric 1 Tick the box of the relevant Fabric Step2 Click Remove gt To stop monitoring a Fabric recommended before fabric maintenance Step 1 Uncheck the active box of the relevant Fabric Step2 Click Save Mellanox Technologies 15 J Rev 1 0 Getting Familiar with Mellanox Care Web UI Active Name Description 4 QA Big Setup Remove Support Sup Setup Changes must saved before existing otherwise will deleted 4 2 2 2 The Health Engine Panel The Health Engine panel enables you to add or change the server IP Username and Password of the Health Engine relevant to the added fabric The drop list includes all the customer fabrics monitored by Mellanox Care see figure 6 The Health Engine panel includes the following information Server IP UFM server IP address OR virtual IP for UFM HA User Name UFM username Password UFM password Figure 6 The Health Engine Panel
19. emote UFM EXTR UFM SM PR Critical 1 7200 s SM problem OBLEM 537 UFM Health UFM_HEALTH_WAT Critical 1 300 Message Watchdog Criti CHDOG_CRITICAL cal 538 Time Diff HA TIME DIFF Critical 100 300 Time difference between master and standby machines 15 Between HA above the threshold of 4 seconds Master time is 5 Servers standby time is 5 539 DRBD Con DRBD BAD CONNE Critical 300 900 Message nection Perfor CTION PERFORMA mance NCE 602 UFM Server UFM FAIL OVER Critical 1 7200 Server s failed server s took ownership Failover 603 Events Suppres EVENTS_SUPPRESSI Critical 300 300 s events are suppressed sion ON Mellanox Technologies Rev 1 0 Table 9 Mellanox Care Events Configuration List Event ID Event Name Event code name Severity Threshold TTL Event Description 605 Report Failed REPORT FAILED Critical 100 300 96s Report failed s 701 Non optimal PHY NON OPTIMA Critical 1 7200 Found s link that operates in s speed mode Link Speed SPEED 702 Unhealthy IB UNHEALTHY_IB_PO Critical 1 7200 Peer Port s is considered by SM as unhealthy due to 5 Port RT 903 Fabric Configura FABRIC CONFIG FA Critical 50 7200 Fabric Configuration failed Please see log for more tion Failed ILED details 904 Device Configu DEVICE CONFIG A Critical 50 7200 Configuration action on device s s failed Plea
20. experience e Non intrusive Performs low foot print monitoring Only operational data is being sent No actual traffic or sensitive information is being collected in the process 6 Mellanox Technologies 1 0 2 Mellanox Care Architecture Mellanox Care communicates with network devices and the fabric manager throughout the man agement network interface with zero impact on InfiniBand production network traffic Figure 1 Mellanox Care Architecture MELLANOX CARE WEB PORTAL ALERT CUSTOMER DATA CENTER DIAGNOSE MELLANOX ANDMANAGE NOC EXPERTS 2 1 Mellanox Care Communication Protocols The following are communication protocols used by Mellanox Care Table 1 Mellanox Care Communication Protocols Protocol Port Description SSH 22 Mellanox Care communicates with the managed devices via SSH in order to upload scripts to the switches and servers and download the required logs from servers and switches FTP 21 Mellanox Care uses FTP to send log files to Mellanox Support HTTP 80 443 Mellanox Care uses HTTP for web UI HTTPS HTTP is also used for contacting the Health Engine SDK SMTP 25 OR 465 Mellanox Care sends e mail notifications to Mellanox NOC via SMTP outgoing port 2 2 Network Security Mellanox Care does not collect any data passwords or information about fabric usage stored on the system The only files that are transmitted to Mellanox are the aforement
21. h the Web User Interface UI which is based on the cus tomer s environment gt To launch the Web User Interface perform the following steps Step 1 Launch an internet browser Step2 In the URL field type http MellanoxCare IP ADD mlnxcare 4 1 Mellanox Care UI Navigator Buttons The following table describes the main Mellanox Care panels and categories Table 4 Navigator Tabs Tab Icon Description 24 Settings Click to view and change general e mail reports and remote folder con figurations Click to view update and manage fabrics and the Health Engine e Fabrics e License Evaluation expires in 2019 10 24 Version 1 0 0 Click to run a simulation cycle to check whether application configura Run Simulation tions were loaded correctly Click to refresh the content of the User Interface Refresh Click to view the last 1000 lines of opt mInxcare log mlInxcare log file Shows type of license and expiration date Shows version number Mellanox Technologies 11 Rev 1 0 Getting Familiar with Mellanox Care Web UI 4 2 Mellanox Care Tabs 4 2 1 Settings Tab The Settings tab includes four panels General E mail Reports Remote Folder 4 2 1 1 The General Panel Internal Structure The General panel enables you to view or update Mellanox Care Servers In the General window you can change the following information
22. ical 1 300 Process s is down cess is down WN 521 15 being STOPPING UFM Critical 1 300 Stopping UFM server now stopped 522 15 being RESTARTING_UFM Critical 1 300 Restarting server now restarted 523 UFM failover is ATTEMPTING UFM Critical 1 300 Attempting UFM failover being attempted FAILOVER 524 UFM cannot con CANNOT CONNECT Critical 1 1800 Connection to the database failed nect to DB _TO_DB Mellanox Technologies 28 Rev 1 0 N 29 Table 9 Mellanox Care Events Configuration List Event ID Event Name Event code name Severity Threshold TTL Event Description 525 Disk utilization DISK THRESHOLD Critical 1 4300 Disk space usage in s is above the threshold of threshold reached REACHED 0 526 Memory utiliza MEMORY THRESHO Critical 100 300 Memory usage is above the threshold of d tion threshold LD REACHED reached 527 CPU utilization CPU THRESHOLD R Critical 300 300 CPU usage is above the threshold of d threshold reached EACHED 528 Fabricinterfaceis FABRIC IFACE DO Critical 1 4300 Fabric interface s is down down WN 0 529 UFM standby UFM STANDBY PR Critical 1 4300 Problem with standby server 5 server problem OBLEM 0 530 SM is down SM IS DOWN Critical 1 300 SM is down 96s 531 DRBD Bad Con DRBD BAD COND Critical 1 4300 Drbd bad condition detected failover or takeover will fail dition 0 533 R
23. ioned diagnostics logs Log files are compressed into a password protected archive which 1 sent over FTP SFTP Mellanox Technologies 7 1 0 Mellanox Care Architecture to the Mellanox Support encrypted library logs located behind a firewall are managed and monitored by the Mellanox network security team SSH access configuration settings of Mella nox Care local fabric components are encrypted into a local secured application database 8 Mellanox Technologies J Rev 1 0 3 3 1 3 1 1 Mellanox Technologies 9 J Installing Mellanox Care The Mellanox Care application is deployed by a Mellanox expert as part of the Mellanox Care service package Mellanox Care can be installed on Asingle standalone dedicated server central management node Thesame UFM server in case server already exists on the customer s fabric A Virtual Machine VM Open Standard Format OVF Mellanox provides an open format VM ova file which can be imported by any Hypervisor A Each of the above options requires access to UFM server through the management net work interface Installation Prerequisites The following table describes Mellanox Care system requirements Table 2 Installation Prerequisites Operating System Package type Description Operating Systems RedHat 6 2 and above SleslISp2 Operating System Packages cronie 1 4 and above httpd 2 2 and above python
24. l HTML version of all run_summaries of the specific day Mellanox Technologies 21 J 1 0 Mellanox Care Report Types 6 2 2 Daily Report Table Information Table 8 Daily Report Table Information Subject Message Run ID Link to FTP Start Time The timestamp of the traps and alarms collection Duration Collection time length 1 represents 1 second If the duration increases it could indicate a potential problem fabric trend etc Critical alarms The amount of existing critical alarms Alarms The amount of total alarms critical minor warning info Percentage Critical alarms divided by total alarms Critical Traps The amount of existing critical traps Traps The amount of total traps critical minor warning info Percentage Critical traps divided by total traps Case Opened The amount of opened cases per time period 22 Mellanox Technologies Rev 1 0 Figure 8 Daily Report Example Subject lt CNT_PASSCODE gt Keep Alive from lt CUSTOMER NAME gt TIMESTAMP Message Summary files on FTP ftp 193 47 165 178 summary 2014 01 30 16 57 51 Run Critical Critical 1 1 10 12 100 12 No 2014 01 29 12 57 54 2 2014 01 29 13 00 01 1 10 3 2014 01 29 13 30 01 1 10 1 10 10 4 2014 01 29 14 00 02 1 ili 10 1 10
25. lanox com SMTP Password Mail Sender icust mellanox com Use Authentication Use SSL Changes must saved before existing tab otherwise will deleted 4 2 1 3 The Reports Panel Internal Structure In the Reports window you can change the following information Case Recipients the contact s e mail receiving case notifications Daily Report Recipients the contact s e mail receiving daily reports Monthly Report Recipients the contact s e mail receiving monthly reports Scan interval The monitoring scan interval of Mellanox Care Daily Report Time sets the time in which the daily reports is received Daily Report Clear Alarm Clears all UFM Alarms after each daily report Monthly Report If selected a Monthly Report will be sent automatically Fabric health report on case Generates a fabric health report via UFM every time a new case 15 detected Collect Log Files From Switches Enables you to collect log files and system snapshots from alarmed switches Collect Log Files From Servers Enables you to collect log files and system snapshots from alarmed servers Send to Mellanox Support If selected reports will be sent to the address mlnx CNT Passcode A passcode for each customer must be unique for each customer in order to prevent case duplication in Sales force system Mellanox Technologies 13 J Rev 1 0 Getting Familiar with Mellanox Care We
26. llanox TECHNOLOGIES Mellanox Technologies Mellanox Technologies Ltd 350 Oakmead Parkway Suite 100 Beit Mellanox Sunnyvale CA 94085 PO Box 586 Yokneam 20692 U S A Israel www mellanox com www mellanox com Tel 408 970 3400 Tel 972 0 74 723 7200 Fax 408 970 3403 Fax 972 0 4 959 3245 Copyright 2014 Mellanox Technologies Rights Reserved Mellanox Mellanox logo BridgeX ConnectX Connect IB CORE Direct InfiniBridge InfiniHost InfiniScale MetroX MLNX OS PhyX ScalableHPC SwitchX UFM Virtual Protocol Interconnect and Voltaire are registered trademarks of Mellanox Technologies Ltd ExtendX FabricIT Mellanox Open Ethernet Mellanox Virtual Modular Switch MetroDX TestX Unbreakable Link are trademarks of Mellanox Technologies Ltd All other trademarks are property of their respective owners 2 Mellanox Technologies Document Number Rev 1 0 Table of Contents Table of Contents l 3 List Of Tables astare poate See Sa sia ae AR Ronin 4 About This Manual ba GR 5 Intended Audience ketene bie nates alee ico 5 Document Conventions Ese Um eed 5 Chapter 1 Mellanox Care 6 1 1 Mellanox Care
27. ly process pings the service periodically based on a predefined frequency This configurable time based daily report is sent to a predefined mailing list along with the activity runs summary Mellanox NOC experts monitor daily activity constantly If a daily report is not reported for a predefined period the Mellanox expert contacts the customer to verify the Mellanox Care pro cess status and together with the customer decides on a course of action to bring the service up again Support experts make their best effort to restore the Mellanox Care service as quickly as possible This service also provides enhanced statistical information which can indicate a potential prob lem fabric trend or fabric malfunction that requires further diagnosis or immediate handling to avoid fabric downtime The daily report contains the following information Rev 1 0 The Subject field contains Subject Name i e daily report The Customer s name Timestamp The message field contains e A link to FTP e A table that lists the alarms and traps received during the past day 6 2 1 Daily Report Derived Log Files Mellanox Care daily report derives the following log files automatically whenever an alert occurs Table 7 Mellanox Daily Report Derived Log Files Source Description Health Engine Fabric Health report and UFM health report Mellanox Care Server Mellanox_care log Run_summary log Run_summary htm
28. se ration Failure CTION FAILED see log for more details 905 Device Configu DEVICE CONFIG A 50 7200 Configuration action on device s Vos Got timeout ration Timeout CTION TIMEOUT Please see log for more details 906 Provisioning Val PROVISIONING VAL Critical 50 7200 Provisioning validation of fabric failed Please see log for idation Failure IDATION FAILURE more details Mellanox Technologies 30
29. unt of the cases sent during each day Dally report a summary of the all the Mellanox care scans during the last 24 hours Exception report reports whenever there is an issue with Mellanox care e Manual run same as the content of daily report The subject heading is named as man ual report Each one of the above reports has its own configurations which is set in the reports tab You can also update the recipients of Case Daily and Monthly Reports In addition you can update the following configurations 6 1 Case Report A Mellanox Care case report 1 sent when a critical alarm is detected in the customer s fabric and it contains the following information The subject field contains Case Number Customer Name Timestamp The message field contains A link to FTP where the logs see the table below are stored These logs record all fabric activities and allow Mellanox support to quickly identify the problem and find a resolu tion A link to the customer UFM Thecritical alarm description which provides information about the specific faulty switch or server as well as the alarm timestamp An inventory list of the fabric mlnxcare version Case details 6 1 1 Case Reports Derived Log Files When a Mellanox Care case is opened it derives the following log files automatically whenever an alert occurs Table 6 Mellanox Care Case Derived Log Files Source Description Health Engine System
Download Pdf Manuals
Related Search
Related Contents
2011/07/06 平成23年度高圧ガス製造保安責任者試験等の実施について Guía del usuario - Xerox Support and Drivers HP Integrity rx7620 Server User Service Guide DN1611 Myryad M Series MA 500 Amplifier EM5E-DVD manuel d`entretien du pspc GCP 10000 ADVANCED POWER Harbor Freight Tools Mobile Bike Storage Product manual Copyright © All rights reserved.
Failed to retrieve file