Home

PGI Profiler User Guide

image

Contents

1. gt Reports time attributable to the code in a routine and all the routines it called This method requires that you recompile and relink your program using one of these compiler options gt Use Mprof func for routine level profiling Routine level profiling can be useful in identifying which portions of code to analyze with line level profiling gt Use Mprof lines for source line level profiling The overhead of using line level profiling can be high so it is more suited for fine grained analysis of small pieces of code rather than for analysis of large long running applications PGI Profiler User Guide 2 Getting Started 1 2 2 Sample based Profiling Sample based profiling uses statistical methods to determine the execution time and resource utilization of the routines source lines and assembly instructions of the program Sample based profiling is less intrusive than instrumentation based profiling so profiling runs take much less time Further in some cases it is not necessary to rebuild the program The basic pgco11ect method described earlier in Basic Profiling is a time based sampling method pgcollect also supports event based profiling on linux86 64 The following sections describe both time based and event based sampling For information on the differences in how instrumentation and sample based profiling measure time refer to Measuring Time Time based Sampling With time based sampling the program s
2. TABLE OF CONTENTS LANE E vii INTENSA EI Y vii Supplementary Documentation vii Compatibility and Conformance to Standards seen vii Organizatiolr errantes NANAIG tacere rede Dn red eda aad viii EN Uer ix TAM C AN A AL PAA a PAA NAAGAPAN BAGA AN X Kee EE X System Requirements cimil x Chapter 1 Getting Started iii EEn 1 E er ll EE 1 1 2 Methods of Collecting Performance Data stv cnn a puna Kuna palahian paaa paan 2 1 2 1 Instrumentation based ProflinG 3 anG DADAANAN cdta 2 1 2 2 Sample based Profiling nennen terne tenentes 3 1 3 Choose Profile Meho EE 4 1 4 Collect Performance Der ite te tee te de ain a e e nu uen les 4 14 1 Greis Be ue lU 4 1 4 2 Using System Environment Variables AEN 5 1 4 3 Profiling with Hardware Event Counterg AEN 5 1 5 Profiler Invocation and Jnttaltzaton AE 5 1 6 Application TUNNG aman KA NN a 5 BERT E 6 1 7 1 Prerequisite Java Virtual Machine EEN 6 1 12 SIOW NG WON T SEEEn 6 Chapter 2 Using PGP ROP eege 7 24 PGPROF Tabs and Icons Overviews s ariete pt eret aa 8 2 2 Pronie INAVIGANON sire eese item eA NAA a AA DIE I Aa 9 PES n o seo CCo ER 13 24 Sorting Profile Data iicet rerit EE Hn anne anaemia ECHO AG 13 2 5 Compiler Feedback ito ose bere mesa deb me e En e c a e Ea eee ee EDS 14 2 5 1 Special Feedback Messages AAA 15 2 6
3. sstk System Time Open nocancel mmul_drw read_nocancel Data_Transfer_to_Host mmul_kernel Data_Transfer_to_Device Device Number CUDA Kernel Call Count Grid Size Thread Block Size Avg Regs Thread 14 Avg Occupancy 1 000000 Coalesced Loads 163 840 Coalesced Stores 40 960 Avg Static Shmem Block 210 400 Parallelism Histogram Compiler Feedback 0 3077 EN 0385 0192 0192 0192 0000 0000 0000 76 10 5 5 5 0 0 0 System Configuration Profiled mm on Tue Mar 29 13 31 34 PDT 2011 Profile pgprof out 0126 EN 0048 fil 66 25 Sorted By Seconds Accelerator Performance Figure 15 CUDA Program Profile In Figure 15 The columns labeled Max CUDA GPU Secs and Max CUDA CPU Secs show times captured by the CUDA driver The Max Seconds column contains timings for host only code Pseudo function names Data Transfer To Host and Data Transfer To Device show the transfer times to and from the GPU The Accelerator Performance Tab shows counter values collected from the GPU PGI Profiler User Guide 28 Chapter 3 COMPILER OPTIONS FOR PROFILING This section describes the PGI compiler options that are used to control profiling and how they are interpreted 3 1 Mprof Syntax You can use the following compiler options to control data collection Most of these options are related to Mprof for which the sy
4. stat s tat no min no avg no max no proc no thread no a11 Set which process fields to display or not to display when using the arguments beginning with no PGI Profiler User Guide 54 Command Line Interface thread th read thread num Specify a thread for a multi threaded process profile times clines ew mec Specify whether time related values should be displayed as raw numbers or as percentages The default is pct history el Repeat previous command num Repeat previous command numbered num in the history list num Repeat the num th previous command numbered num in the history list string Repeat most recent command starting with string from the history list PGI Profiler User Guide 55 Chapter 9 PGCOLLECT REFERENCE The pgcollect command is a development tool used to collect performance data for analysis using the pgprof performance profiler This section describes how to use pgcollect 9 1 pgcollect Overview pgcollect runs the specified program with the supplied arguments While the program runs pgcollect gathers performance statistics When the program exits the data that is gathered is written to a file You can then use this file in the PGPROF performance profiler to analyze and tune the performance of the program The pgcollect command supports two distinct methods of performance data collection Time based sampling Creates a time based profile that co
5. 1T out Function Max Seconds mm fv update nonbon f_nonbon a next f torsion f angle tpac T bond a m serial a inactive f zero mp ecs T box fmth i dsin gh System Time fv update nonbon f hybrid mp_bcs OO O 00000000 oooooooonunntr u Sorted By Seconds S Process Thread Browser for application ammp Profile Seconds s2 pgprof 1T out 161 93 MI 100 gt Po 161 33 o j 100 S Process Thread Viewer for routine mm fv update nonbon Routine Seconds mm fv update nonbon 122 0 ii Po 122 55 Parallelism Histogram Compiler Feedback I System Configuration Accelerator Performance Profiled ammp on Wed Nov 03 14 55 08 PDT 2010 Profile pgprof 1T out Figure 2 PGPROF Initial View gt To zoom in to the line level for a particular routine double click the function name This action opens a tab that displays profiling data specific to the given function The tab label is the function name followed by an x icon You use the x icon to close the tab when you no longer want to view that information In this tab PGPROF displays the source code for that routine together with the performance data for each line For example if you double click on the function t PGPROF displays a new tab labelled ft that contains the source code for that function as illustrated in Figure 3 Because your program is probably opt
6. DATA_CACHE_MISSES l II DATA CACHE REFILLS FROM L7 OR NORTHBRIDGE th Sorted By CPU_CLK_UNHALTED Parallelism Histogram Compiler Feedback I System Configuration I Accelerator Performance Profiled ammp on Mon Nov 01 17 18 48 PDT 2010 for 6 242 seconds Profile pgprof 4T out Figure 7 Sort Example 2 5 Compiler Feedback The PGI compilers generate a special kind of information that is saved inside the executable file so that it is available to tools such as PGPROF to help with program analysis A compiler discovers a lot about a program during the build process Most compilers use such information for compilation and then discard it However when the Mprof or Minfo ccff options are used the PGI compilers save this information in the object and executable files using the Common Compiler Feedback Format or CCFF Feedback messages provide information about what the compiler did in optimizing the code as well as describe obstacles to optimization Most feedback messages have associated explanations or hints that explain what the message means in more detail Further these messages sometimes provide suggestions for improving the performance of the program O The information icon indicates that CCFF information is available In PGPROF you can access Compiler Feedback by clicking an information icon in the left margin of the Statistics Table This open
7. Figure 13 Figure 14 Figure 15 Figure 16 Figure 17 Figure 18 Figure 19 Figure 20 LIST OF FIGURES PGPROP OVGIVIOW M 8 PGPROF Initial Ven 10 Source COTS T 11 Assembly Level VIEW e itn b ee ies 12 View Navigation Buttons cit mr maa rta o rd rh ta a rta d t 12 HotSpot Navigation Controls AA 13 a ro im n nrbt dai 14 Multi Threaded Program Example inae aE nene 17 sample MPI uro 19 sample Scalability Comparison aaa dana t tee e test ida er eerte eeu haunt 20 Profile with Hardware Event Counter nennen tnter nennen 22 Accelerator Performance Data for Routine Level Profiling Example ANEN 24 Source Level Profiling for an Accelerator Region AEN 25 Source Level Profiling for an Accelerator Kernel nananana nana annnnawananaawannanawawaasasnanasanawsasannasanasansasansasans 26 CUDA Program Profile seen te tentent tenente tenente tenis 28 PGPROF User Interfaces a a R del dels 39 R I lte le 43 FOCUS Panel TAS L 46 Accelerator Performance tab of Focus Panel ANEN 48 CUDA Program Profile a KAG An terit eter a 50 PGI Profiler User Guide V LIST OF TABLES Table 1 PGPROF Icon Summary m 9 Table 2 MPI Profiling Options AEN 18 Table 3 PGPROF dE EE 51 PGI Profiler User Guide vi PREFACE This guide describes how to use the PGPROF profiler to tune serial and parallel applications built with The Portlan
8. Parallelism Histogram Compiler Feedback System Configuration Profiled mm on Tue Mar 29 13 31 34 PDT 2011 Profile pgprof out Figure 20 CUDA Program Profile PGI Profiler User Guide Accelerator Performance 50 Chapter 8 COMMAND LINE INTERFACE The command line interface CLI for non GUI versions of the PGPROF profiler is a simple command language This command language is available in the profiler through the text option The language is composed of commands and arguments separated by white space A pgprof gt prompt is issued unless input is being redirected This section describes PGPROF s command line interface providing both a summary and then more details about the commands 8 1 Command Description Syntax This section describes the profiler s command set Command names are printed in bold and may be abbreviated as indicated Arguments enclosed by brackets are optional Nk Separating two or more arguments by I indicates that any one is acceptable vY v v y Argument names in italics are chosen to indicate what kind of argument is expected v Argument names that are not in italics are keywords and should be entered as they appear 8 2 PGPROF Command Summary The Table 3 summarizes the commands for use in the CLI version of PGPROF providing the applicable arguments and a brief description of the use of each command The section that follows the table pro
9. Profiling Parallel Programs na rit cree ctt tte tito rette etre ne tlt eret e P e tie tu t NAA 15 2 6 1 Profiling Multithreaded Prograims ssscsccsscsessseecteesseesesseneeceseensecsssensessnesnseseatensessneeneeceateneeteneenseteaeensenens 16 26 2 Profiling MPII Programs vise aap t ertt tertie tard e e ed en Fed E tdv n sana 17 2 1 scalability COMPANSON EE 19 2 8 Profiling Resource Utilization with Hardware Event Counterg ANEN 20 2 8 1 Profiling with Hardware Event Counters Linux Only 21 2 8 2 Analyzing Event Counter Profiles AEN 21 2 9 Profiling GPU Programs ANEN 22 2 9 1 Profiling Open ACC PrOGramms x cick deett cocta ertt tin anda adie ove nee BADA ANAND 22 PGI Profiler User Guide ii 2 9 2 Profiling CUDA Fortran Program AEN 26 Chapter 3 Compiler Options for Prol haa aan oc ieri cie ir E inrer aiaa 29 SN ME ursi e Pe ee 29 3 2 Profiling Compilation Optons seen tnnt terne tete tenente tens 29 Chapter 4 Command Eine Options scri RERO EXER I EN Se r aaar aA ARENA NANANA NANG 31 4 1 Command Line Option Descriptions sse nennen tenete entrent teens 31 4 2 Profiler Invocation and Giatup nennen tentent tentent tenentes 32 Chapter 5 Environment Valiables gearen DEE 34 DA System Environment Variables ana itemm a 34 Chapter 6 Data atid Precisi n caida 35 SEM CCP 35 6 2 Profile Data as socio 35 6 3 Caveats Precision of Profiling Results 36 6 3 1 Accuracy of P
10. or All though not all of these choices are available for all columns gt Select Graph Colors This menu option opens a color chooser dialog box and a bar chart preview panel The preview panel contains the bar chart bar colors and the three bar chart attributes gt The bar chart bars can be gradient filled meaning that the color of the bar gradually transitions from the Bar Start Color to the Bar End Color To have solid colored bars without gradient fill which is the default simply set both of these colors to the same color The Filled Text Color attribute represents the text color inside the filled portion of the bar chart gt The Unfilled Text Color attribute represents the text color outside the filled portion of the bar chart The Background Color attribute represents the color of the unfilled portion of the bar chart The Reset button allows you to reset the selected bar chart or attribute to its previously selected color gt The OK button accepts your changes and closes the dialog box e Closing the dialog box is the same as choosing OK To modify a bar chart or attribute color 1 Click the radio button 2 Choose a color from the Swatches HSB or RGB pane 3 Click the OK button to accept the changes and close the dialog box PGI Profiler User Guide 41 PGPROF Reference PGPROF saves color selections for subsequent runs unless the Save Settings on Exit box is unchecked as describe
11. 0 1 event DRAM ACCESSES 500000 0x07 0 1 ed 13 Iyl Wied To use these events invoke pgcollect with the following arguments es function mem bw data 9 7 OpenACC and CUDA Fortran Profiling If you are profiling a program that uses the PGI Accelerator model or CUDA Fortran pgcollect automatically collects information for you PGI Profiler User Guide 60 pgcollect Reference 9 7 1 OpenACC Profiling pgcollect automatically collects and includes performance information for the PGI Accelerator model programs in the profile output for the program Inclusion of the accelerator performance information in the program s profile output occurs for both time based sampling and on Linux for event based sampling 9 7 2 CUDA Fortran Program Profiling If you are profiling a program that uses CUDA Fortran kernels running on a GPU pgcollect cuda collects performance data from CUDA enabled GPUs and includes it in the profile output for the program The syntax for this command option is cuda gmem branch cfg lt cfgpath gt ccl3 cc20 list The sub options modify the behavior of pgcollect cuda as described here branch Collect branching and warp statistics cc13 Use counters for compute capability 1 3 default ccnm Use counters for compute capability n m e TipUsepgcollect help to see which compute capabilities your system supports cfg lt cfgpath gt Specify lt cfgpath gt as CUDA profil
12. Guide 44 PGPROF Reference feedback information for the routine a round button containing the letter i is at the far left of the row Clicking that button populates the Compiler Feedback tab with the compiler feedback relating to that routine Line level View You access the line level view of a routine by clicking that routine s row in the routine level view PGPROF opens a new tab showing the line level information for the routine The tab label is the routine name and the tab contains an x which allows you to close the tab when you are done viewing the source code The Statistics Table in the new tab shows the source code for the selected function with performance data and Compiler Feedback buttons as with the routine level view Assembly level View You access the assembly level view of a source line or routine by clicking the assembly code icon the plus symbol on the Statistics Table row that corresponds to the row of interest in the line level view The table changes to show the assembly code interspersed with the source lines that were compiled to generate the code 7 4 2 Source Code Line Numbering In the optimization process the compiler may reorder the assembly instructions such that they can no longer be associated with a single line Therefore for optimized code a source line may actually be a code block consisting of multiple source lines This occurrence is common and expected and should not interfere with the
13. MN 100 309 656 000 100 223 456 000 MN 100 223 456 000 Mat Routine CPU_CLK_UNHALTED X DATA CACHE MISSES DATA CACHE REFILLS FRO DATA CACHE REFILLS FRO mm fv update nonbon 1 483 92 30 104 984 000 ff 21 90 176 000 MW 29 51 160 000 ff 23 21 8 4 5 4 104 984 000 ll 40 368 000 21 680 000 22 592 000 20 344 000 90 176 000 8l 294 34 824 000 11 19 352 000 6 19 592 000 6 16 408 000 5 51 160 000 M 23 20 064 000 9x 10 968 000 5 10 704 000 5 9 424 000 4 1 483 92 30 1 483 92 30 848 24 17 723 52 14 634 24 13 e Parallelism Histogram Compiler Feedback System Configuration Accelerator Performance Profiled ammp on Mon Nov 01 17 18 48 PDT 2010 for 6 242 seconds Profile pgprof 4T out Figure 8 Multi Threaded Program Example You can use thread specific data to determine how well balanced the application is Ideally each thread would spend exactly the same amount of time on a given part of the program If there are large disparities in the time spent by the various threads this points to a load imbalance where some threads are left idle while other threads are working In this case the resources of the system are not being used with 100 efficiency For example in the program illustrated in Figure 8 we can see that thread 0 spent 30 of the tim
14. Manual by Samuel P Harbison and Guy L Steele Jr Prentice Hall 1987 gt The Annotated C Reference Manual by Margaret Ellis and Bjarne Stroustrup AT T Bell Laboratories Inc Addison Wesley Publishing Co 1990 gt PGI User s Guide PGI Release Notes FAQ Tutorials http www pgroup com gt MPI CH http www unix mcs anl gov MPI mpich gt OpenMP http www openmp org System Requirements gt Linux or Windows For supported releases refer to http www pgroup com faq install htm gt Intel x86 and compatible AMD Athlon or AMD64 or Intel 64 or Core2 processor PGI Profiler User Guide X Chapter 1 GETTING STARTED This section describes the PGPROF profiler PGPROF provides a way to visualize and diagnose the performance of the components of your program Using tables and graphs PGPROF associates execution time with the source code and instructions of your program allowing you to see where and how execution time is spent Through resource utilization data processor counters and compiler feedback information PGPROF also provides features to help you understand why certain parts of your program have high execution times You can also use the PGPROF profiler to profile parallel programs including multiprocess MPI programs multi threaded programs such as OpenMP programs or a combination of both PGPROF provides views of the performance data for analysis of MPI communication multiprocess and multi thread load
15. System Time Routine Seconds gt gt System_Time 0 072 P 0 0 072 Parallelism Histogram I Compiler Feedback System Configuration I Accelerator Performance Profiled ammp on Wed Nov 03 14 56 56 PDT 2010 Profile pgprof 4T time out k Figure 10 Sample Scalability Comparison Each profile entry that has timing information has a Scale value The scale value measures how well these parts of the program scaled or improved their performance as a result of parallelism A scale value of zero indicates no change in the execution time between the two runs A scale value of one means that part of the program achieved perfect scalability For example if a routine had a Time value of 100 seconds with one thread and 25 seconds with four threads it would have a Scale value of one A negative value is the relative slowdown without taking the number of threads or processes into account If a routine takes 20 more time to execute using four threads than it took using one thread the Scale value is 0 2 A question mark in the Scale column indicates that PGPROF is unable to perform the scalability comparison for this profile entry For example scalability comparison may not be possible if the two profiles do not share the same executable or input data 2 8 Profiling Resource Utilization with Hardware Event Counters e Important Profiling with hardware counters is availabl
16. The current sort order such as sorting by the CPU Clock time is displayed at the bottom of the Statistics Table For example Sort Example shows the message Sort By CPU CLK UNHALTED at the bottom of the Statistics Table and the Histogram PGI Profiler User Guide 13 Using PGPROF E pgprof lea File Edit View Sort Help BEA lt gt Find 7 Hotspot CPU_CLK_UNHALTED IB y pgprof 4T out 0 Function Max CPU_CLK_UNHALTED Max DATA_CACHE_MISSES Max DATA_CACHE_REFILLS_FROM_ mp_barrierp 3 167 120 000 MN 36 109 336 000 EN 22 2 728 000 a mm fv update nonbon 1 483 920 000 E 174 104 984 000 ll 21 90 176 000 899090 anm serial 1 186 800 000 Bl 14 616 000 0 3 536 000 mp barrier 899 200 000 10 195 816 000 MN 40 125 856 000 MS tether 768 320 000 9 21 400 000 4X 45 264 000 il f nonbon 183 840 000 2 12 616 000 3 10 560 000 mp barrierw 150 320 000 2 7 392 000 2 504 000 O T torsion 84 720 000 1 4 816 000 1 2 984 000 CH read_eval_do 76 880 000 1 3 376 000 1 2 992 000 eval 76 160 000 1 3 360 000 1 2 760 000 strcmp 70 080 000 1 1 008 000 0 376 000 v nonbon 53 040 000 1 1 240 000 0 736 000 a next 45 040 000 1 968 000 0 1 008 000 CH tpac 39 520 000 0 1 152 000 0 1 144 000 f angle 37 760 000 0 1 280 000 0 1 256 000 5 4 D b Sorted By CPU CLK UNHALTED CPU_CLK_UNHALTED
17. as the data file merge m erge datafile PGI Profiler User Guide 53 Command Line Interface Merge the profile data from the named datafile into the current loaded dataset The datafile must be in standard pgprof out format and must have been generated by the same executable file as the original dataset no datafiles are modified process pro cess processor num For multi process profiles specify the processor number of the data to display print p rint gt filename Display the currently selected function data If the filename argument is present the output is placed in the named file The 5 means redirect output and is optional quit q uit Exit the profiler select exel exeu sede zaart tame woste Jee All J euro Display data for a selected subset of the functions This command is used to set the selection key and establish a cutoff percentage or value The cutoff value must be a positive integer and for time related fields is interpreted as a percentage The gt means greater than and is optional The default is all sort eese Dow bass ese mia proe isse calles covez timccall cins cost naine msgs mege senu megs recy byrces bytes senc owes reer vesies File Function level data is displayed as a sorted list This command establishes the basis for sorting The default is max time srcdir src dir directory Set the source file search path
18. controls include PGI Profiler User Guide 43 PGPROF Reference gt A drop down menu labeled HotSpot which you use to select the specific performance measurement of interest gt Three navigation buttons containing Forward and Back icons with associated plus and minus signs When the profile is first displayed the Statistics Table selects the row for the routine with the highest measured Time as though you had clicked on that row To navigate to the row with the next highest Time you click on the button labeled with the Forward icon and the minus sign denoting the next Time HotSpot lower than the current one Once you have navigated to this second HotSpot the Back HotSpot buttons are activated allowing you to navigate to the hottest HotSpot using the lt lt button or to the next higher Time using the lt button You can use the HotSpot drop down menu to change the measurement used to identify the HotSpots The default selection in the HotSpot menu is Time assuming that Time is one of the available measurements You can click on the down arrow in the drop down menu to select any other metric listed in the menu then click the Hottest button to navigate to the row showing the routine with the highest measured value for that metric 7 4 PGPROF Statistics Table This section describes the PGPROF Statistics Table The Statistics Table displays an overview of the performance data and correlates it with the associ
19. ds 51 6 5 Command Refererce ivi chc edema rena ie ee deci dore ge E CR naka cantare wees 52 Chapter 9 pgcollect Refererice Dees Eeer 56 9 1 pgcollect Overview nente tenente tnnt tenente tenente tente tentent te tente tentent tentent tenis 56 PAC Co cei reste essed ween seta nds sre nei tret deb AA 57 9 3 Build Tor Deele ege onde donde Ghana e tete ord ute te ter eec uten oth des 57 9 4 General Optons tentent tenente ente tenete tete teneret tete terne terne ennt 57 PGI Profiler User Guide iii 9 5 Time Based Profiling ette eet ed e re We E e Pe E 57 9 5 1 Time Based Profiling Options tentent rr 58 9 6 Event Based Profiling eie iere ii Dee shed ics it ila t i no ee dd 58 9 6 1 Root Privileges Requirement aaa nananana 58 9 62 Interrupted Profile Uns oe te i ere ee e ood Gianni pa IRE TIS 59 9 6 3 Event based Profiling Options AAA steed iva ita sr itia e e epe ei 59 9 6 4 Defining Custom Event Specifications seen tentent tentent 60 9 7 OpenACC and CUDA Fortran Profiling seen EE 60 9 01 OpenAGG Breng ete a rit Eee Pere e Hp Ee HU HER RES 61 93 2 CUDA Fortran Program Profilinig nmang NANG GAGA DANG ANGARA 61 9 1 3 Performance Tip eo BANANA NANANA HERE EU LANA HU vee ail Apan ie pe dt 61 PGI Profiler User Guide Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Figure 11 Figure 12
20. no dialog appears Further when the user selects the menu File Open Profile then the Text Field for Executable is set with lt execname gt in the dialog pgprof exe lt execname gt lt profilename gt PGPROF tries to open the profile lt profilename gt using lt execname gt for the executable name Further the GUI is populated according to profile data if valid PGI Profiler User Guide 33 Chapter 5 ENVIRONMENT VARIABLES This section describes the system environment variables that you can set to change the way profiling is performed 5 1 System Environment Variables As you learned in Basic Profiling a profiled program collects call counts and or time data When the program terminates a profile data file is generated Depending on the profiling method used this data file is called pgprof out or gmon out You can set the following system environment variables to change the way profiling is performed GMON ARCS Use this environment variable to set the maximum number of arcs caller callee pairs The default is 4096 This option only applies to gprof style profiling that is programs compiled with the pg option PGPROF DEPTH Use this environment variable to change the maximum routine call depth for PGPROF profiled programs The default is 4096 and is applied to programs compiled with any of the following options Mprof func Mprof lines or Mprof time PGPROF EVENTS Use this environmen
21. of each running process thread This data collection method employs a single timer that starts at zero 0 and is incremented at a fixed rate while the active program is being profiled For multiprocessor programs there is a timer on each processor and the profiler s summary data minimum maximum and per processor is based on each processor s time executing in a function How the timer is incremented and at what frequency depends on the target machine The timer is read from within the data collection functions and is used to accumulate COST and TIME values for each line function and the total execution time The line level data is based on source lines however in some cases there may be multiple statements on a line and the profiler shows data for each statement For instrumentation based profiling information provided for longer running functions are more accurate than for functions that only execute for a short time relative to the overhead of the individual timer calls Refer to Caveats Precision of Profiling Results for more information about profiler accuracy 6 2 Profile Data The following statistics are collected and may be displayed by the PGPROF profiler BYTES For MPI profiles only This is the number of message bytes sent and received PGI Profiler User Guide 35 Data and Precision BYTES RECEIVED For MPI profiles only This is the number of bytes received in a data transfer BYTES SENT For MPI profiles only This i
22. separator is platform dependent on Linux and Mac OS it is a colon and on Windows it is a semicolon Directories in the path are then searched in order from left to right When a directory with a filename that matches a source file is found that directory is used PGI Profiler User Guide 31 Command Line Options Here is an example for Linux and Mac OS In this example the profiler first looks for source files in the current directory then in the src directory followed by the STEPS directory EI oof Sees SIUNS Here is the same example for Windows SI oof PSS SUMS For more information see the Open Profile item in the description of the File Menu jarg arg1 arg2 argn Pass specified arguments separated by commas to java For example the following option passes the argument Xmx2 5 6m to java jarg Xmx256m This option is provided for troubleshooting purposes and is expected to rarely be used If you do use this option be certain not to forget the comma between the option and the first argument scale file s Compare scalability of datafile with one or more files A list of files may be specified by enclosing the list within quotes and separating each filename with a space For example scale one out two out This example compares the profiles one out and two out with datafile or pgprof out by default If only one file is specified quotes are not required For sample based prof
23. time and date of the profile run execution time number of processes if more than one and the datafile name PGI Profiler User Guide 38 PGPROF Reference The following sections describe each of these components in more detail Menus sz Max DATA CACHE MISSES 100K 308 656 000 100 223 496 000 P T Y Process Thread Viewer for routine mp barrier Routine CPUC DATA CACHE MESES DATA CACHE REFILLS FROM DATA CACHE REF we barrier 99 20 18x 195 816 000 MIN em 125 885 000 ax 125 8 vro 125 855 000 Ei 41X 125 8 1 899 2 i E 8 1 6x 19 9 T1 Tj T5 T tabs et Paratielism Histogram Profilect ammp on Mon Nov O Information bar Figure 16 PGPROF User Interface 7 2 PGPROF Menus PGPROF had the following menus File Edit View Sort and Help This section describes each menu in detail Keyboard shortcuts when available are listed next to menu items 7 2 1 File Menu The File menu contains the following items gt New Window control N Select this option to create a copy of the current profiler window on your screen gt Open Profile Select this option to begin analyzing a different profile When you see the dialog box fill in or browse to the information requested about the profile data file default pgprof out the executable file and the location of the source files When you click OK a new profile session is started using the information specif
24. 503 el 40x 3f 50 33 16 40 gt P3 o 26 2 501 25 3 20x i 17x 2 1 d gl 20x gt Po 0 0093 e 2 5 eM 40 3 9 P2 0 0092 25x 2 5014 25 3 20x 1 2 2 Pi 0 0091 sl 2 501 25 3 20x 1 2 X 16 0 0 8 On S Process Thread Viewer for routine main Routine Time v Count Messages Messages Sent Messages Recei Bytes es Sen Bytes Received 24x 0 T 50x 3 33x 321 16 40x rye 2 gt 24 7 2 16 20x 8 20x s 20 50 sl 20x ON sox 3 3 2 0x cl 40x 16 As 0 0 0 7 161 20x 20 sl 20x a 0 Parallelism Histogram 2 Compiler Feedback System Configuration Accelerator Performance Profiled cpi on Thu Nov 21 14 42 39 PST 2013 for 0 009554 seconds with 4 processes Profile pgprof out Figure 9 Sample MPI Profile 2 7 Scalability Comparison PGPROF provides a Scalability Comparison feature that measures changes in the program s performance between multiple executions of an application Generally this information is used to measure the performance of the program when it is run with a varying number of processes or threads To use scalability comparison first generate two or more profiles for a given application For best results compare profiles from the same application using the same input data with a different number of threads or processes Scalability is computed using the maximum time spent in each thread process Depending on how you profiled your pro
25. Mmpi mpich Mprof mpich1 This option has been deprecated It continues to direct the compiler to perform MPI profiling for MPICHI but only if you set the environment variable MPIDIR to the root of an MPICH1 installation Implies Mmpi mpich1 Mprof mpich2 This option has been deprecated It continues to direct the compiler to perform MPI profiling for MPICH2 but only if you set the environment variable MPIDIR to the root of an MPICH2 installation Implies Mnpi mpich2 Mprof msmpi Perform profiling for Microsoft MSMPI on Windows systems Implies option Mmpi msmpi Mprof mvapich1 This option has been deprecated It continues to direct the compiler to perform MPI profiling for MVAPICH1 but only if you set the environment variable MPIDIR to the root of an MVAPICHI installation Implies Mmpi mvapichl Mprof sgimpi Perform profiling for SGI MPI Implies option Mmpi sgimpi Li This option is required even if you compile and link using the SGI MPI mpi cc or mpi f 90 compiler wrappers Mprof time Linux Generate a profile using time based assembly level statistical sampling This is equivalent to using the pg option except the profile is saved in a file named pgprof out rather than in gmon out P Linux Enable gprof style sample based profiling Running an executable compiled with this option produces a gmon out profile file which contains routine line and assembly level profiling data PGI Profiler U
26. SI X3 159 1989 gt ISO IEC 9899 1999 Information technology Programming Languages C Geneva 1999 C99 gt HPDF Standard High Performance Debugging Forum http www ptools org hpdf draft intro html gt A Fortran 2003 Standard High Performance Debugging Forum http http www ptools org hpdf draft intro html Organization The PGPROF Profiler User s Guide contains ten sections that describe the PGPROF Profiler a tool for analyzing the performance characteristics of C C F77 and F95 programs Getting Started contains information on how to start using the profiler including a description of the profiling process information specific to certain how to profile MPI and OpenMP programs and how to profile with hardware event counters Using PGPROF describes how to use the PGPROF graphical user interface GUI Compiler Options for Profiling describes the compiler options available for profiling and how they are interpreted Command Line Options describes the PGPROF command line options used for profiling and provides sample invocations and startup commands Environment Variables contains information on environment variables that you can set to control the way profiling is performed in PGPROF PGI Profiler User Guide viii Preface Data and Precision contains descriptions of the profiling mechanisms that measure time how statistics are collected and the precision of the profiling results PGPROF R
27. ailable to you when you analyze your application s performance can be significantly enhanced if you compile and link your program using the Minfo ccff option This option saves information about the compilation of your program compiler feedback for use by PGPROF For more information on compiler feedback refer to For a more complete analysis our command execution might look similar to this S pgfortran fast Minfo ccff o myprog myprog 90 S pgcollect myprog S pgprof exe myprog 1 2 Methods of Collecting Performance Data PGI provides a number of methods for collecting performance data in addition to the basic pgcollect method described in the previous section Some of these have advantages or capabilities not found in the basic pgcollect method We divide these methods into two categories instrumentation based profiling and sample based profiling 1 2 1 Instrumentation based Profiling Instrumentation based profiling is one way to measure time spent executing the functions or source lines of your program The compiler inserts timer calls at key points in your program and does the bookkeeping necessary to track the execution time and execution counts for routines and source lines This method is available on all platforms on which PGI compilers are supported Instrumentation based profiling gt Provides exact call counts gt Provides exact line block execution counts gt Reports time attributable to only the code in a routine
28. ated source code or assembly instructions This is where you should start when analyzing performance data with PGPROF The Statistics Table displays information at up to three levels depending on the type of profile data collected how the program was built and whether the PGPROF source file search path has been set to include the program source directories 7 4 1 Performance Data Views The Statistics Table allows you to zoom in and out on the components of your program by providing several views the routine level view the line level view and the assembly level view gt The initial view when you invoke PGPROF is the routine level view gt To navigate to the line level from the routine level double click on the Statistics Table row corresponding to the function of interest If the program was built so thatit does not contain line location information then this action results in an assembly level display gt To navigate to the assembly code level from the line level click the assembly code icon the plus symbol on the Statistics Table row that corresponds to the source line of interest e You can use the View Select Columns menu option to select the data shown in the Statistics Table Routine level view The routine level view shows a list of the functions or subprograms in your application with the performance data for that routine in the same row of the table In addition if there is any compiler PGI Profiler User
29. balancing and scalability Using the Common Compiler Feedback Format CCFF PGI compilers save information about how your program was optimized or why a particular optimization was not made PGPROF can extract this information and associate it with source code and other performance data allowing you to view all of this information simultaneously Each performance profile depends on the resources of the system where it is run PGPROF provides a summary of the processor s and operating system s used by the application during any given performance experiment 1 1 Basic Profiling Performance profiling can be considered a two stage process gt In the first stage you collect performance data when your application runs using typical input gt In the second stage you analyze the performance data using PGPROF There are a variety of ways to collect performance data from your application For basic execution time profiling we recommend that you use the pgcollect tool which has several attributes that make it a good choice gt You don t have to recompile or relink your application gt Data collection overhead is low gt tis simple to use PGI Profiler User Guide 1 Getting Started gt It supports multi threaded programs gt It supports shared objects DLLs and dynamic libraries To profile your application named myprog you execute the following commands S pgcollect myprog S pgprof exe myprog The information av
30. ce 9 5 1 Time Based Profiling Options time Provide time based sampling only The sampling interval is 10 milliseconds This option is the default When using pgcollect for time based sampling you can have multiple instances of pgcollect running simultaneously but doing so is not recommended since this will probably skew your performance results 9 6 Event Based Profiling You can use the pgcollect command on linux86 64 to drive an OProfile session Event based profiling provides several predefined data collection options that gather data from commonly used counters For event based sampling the only required argument is the program or script which is either the filename of the program to be profiled or the name of a script that invokes the program Using a script can be useful if you want to produce an aggregated profile of several invocations of the program using different data sets In this situation use the exe option which allows the data collection phase to determine which program is being profiled When applicable you can provide arguments for the specified program or script Since OProfile provides only system wide profiling when you invoke pgcollect it provides a locking mechanism that allows only one invocation to be active at a time The pgcollect locking mechanism is external to OProfile and does not prevent other profile runs from invoking opcontro1 through other mechanisms 9 6 1 Root Privileges Requireme
31. ce tab shows a breakdown of timing statistics for the region and the accelerator kernels it contains A routine can contain more than one accelerator region Figure 13 shows an example of a source level view with an accelerator region directive selected In this illustration if you want to see the Seconds column you could scroll to the right in the Statistics Table PGI Profiler User Guide 24 Using PGPROF E paprof Co File Edit View Sort Help o aj lg s Find BH dh P HotSpot Seconds B QR R rofout mml 2 Max Accelerator Kernel Til 0000 0000 0000 0000 0008 Max Seconds 0 0000 0 0000 0 0000 0 0000 Max Accelerator Region Time 0 0000 0 0 0000 0 2207 MN 100 0000 0 0000 0 0000 0 0000 0 0000 0 0000 0 0000 0 0000 0 0000 0 0000 0 Source subroutine mmi a b c m a b c o real dimension 1 acc region do j 1 m do i 1 m a i j 0 0 enddo do k 1 m do i 1 m a i j a i J bCi k c k 3 enddo enddo enddo oooooooooooo oooooooooo oooooooooo Sorted By Line Device Number Accelerator Initialization Time secs Accelerator Kernels Time secs 0 124726 0 080791 Data Transfer Time Accelerator Region Maximum time spent Minimum time spent Average time spent secs Execution Count in Accelerator Region w o Init Time secs in Accelerator Region w o Init Time secs in Accelerator Region w o Init Ti
32. compiler may significantly reorder the assembly instructions used to implement the loop making it impractical to associate any given instruction with a line in the loop However it is possible to associate all of a loop s instructions with that loop so all of the performance data for the loop is associated with a single line For example in Figure 3 the information for the entire do loop at line 516 is associated with line 516 To zoom in to the assembly level for a particular source line click the plus symbol in the row of the Statistics Table containing that source line PGPROF displays the routine with assembly code interspersed with the source lines with which that assembly code is associated as Figure 4 illustrates the for loop at line 510 PGPROF displays performance data associated with a specific assembly instruction in the row of the Statistics Table containing that instruction PGI Profiler User Guide 11 Using PGPROF P A Greet LS File Edit View Sort Help D BA lt gt Fine Hotspot Seconds Y Line a Source Max Seconds 509 nxp2 nxp 2 0 000 0 00 510 do 20 m 1 nxp2 0 000 0 510 Dx404487 F3 F 10 D 31 6B 25 0 movss 0x256B31 rip xmml 0 000 0 510 Px40488F 85 CO testl eax eax 0 020 0 510 0x404491 F3 41 F 100 mowss Gr n Bn 0 000 0 510 Px404496 F3 41 F 5C 484 subss 4 r8 xmm1 0 020 0 510 Dx404A8C 7E 72 jle Ox72 lt 0x404B10 gt 0 000 0 510 Ox404A9E 89 DF mo
33. current instruction address program counter is read and tracked at statistically significant intervals Instruction addresses where a lot of time is spent during execution are read numerous times The profiler can map these addresses to source lines and or functions in your program providing an easy way to navigate from the function where the most time is spent to the line or to the assembly instruction You can build your program using the Mprof time compiler option for time based sampling of single threaded Linux programs When using Mprof time you are required only to re link your program However unless you compile with Minfo ccff compiler feedback will not be available As described previously in Basic Profiling we recommend using pgcollect for time based profiling Event based Sampling As well as reading the program s instruction address event based sampling uses various methods to read and track the values of selected hardware counters These counters track processor events such as data cache misses and floating point operations You can use this information to help determine not just that time is being spent in a particular block of code but why so much time is spent there If there is a bottleneck related to a particular resource such as the level two data cache these counters can help you discover where the bottleneck is occurring Event based sampling requires that a performance tool named OProfile be co installed with
34. d Group PGI Fortran C and C compilers for X86 AMD64 and Intel 64 processor based systems It contains information about how to use the PGI profiling tools as well as detailed reference information on commands and graphical interfaces Intended Audience This guide is intended for application programmers scientists and engineers proficient in programming with the Fortran C and or C languages The PGI tools are available on a variety of operating systems for the X86 AMD64 and Intel 64 hardware platforms This guide assumes familiarity with basic operating system usage Supplementary Documentation See http www pgroup com docs htm for the PGPROF documentation updates Documentation delivered with PGPROF should be accessible on an installed system by accessing docs index htm in the PGI installation directory Typically the value of the environment variable PGI is set to the PGI installation directory See http www pgroup com faq index htm for frequently asked PGPROF questions and answers Compatibility and Conformance to Standards The PGI compilers and tools run on a variety of systems They produce and or process code that conforms to the ANSI standards for FORTRAN 77 Fortran 95 C and C and includes extensions from MIL STD 1753 VAX VMS Fortran IBM VS Fortran SGI Fortran Cray Fortran and K amp R C PGF77 PGF90 PGCC ANSI C and PGCPP support parallelization extensions based on the OpenMP defacto standard PGHPF support
35. d later in this section Font This menu option opens the Fonts dialog box You can change the font and or font size using this dialog s drop down lists As you change the font you can preview the changes in the Sample Text pane To change the font you must click the OK button e Tip Click Cancel or close the dialog box to abort any changes Show Tool Tips Select this check box to enable tool tips Tool tips are small temporary messages that pop up when the mouse pointer is positioned over a component such as a button in the user interface Tool tips provide a summary or hint about what a particular component does Clear this check box to turn tool tips off 7 2 4 Sort Menu Use the Sort menu to change the metric used to sort profile entries The current sort order is displayed at the bottom of the Statistics Table and the Histogram tab The default sorting metric is time for function level profiling and source line number for line level profiling The sort is performed in descending order from highest to lowest value except when sorting by filename function name or line number Filename function name and line number sorting is performed in ascending order lowest to highest value Sorting is explained in greater detail in Sorting Profile Data 7 2 5 Help Menu The Help menu contains the following items gt PGPROF Help This option invokes PGPROF s integrated help utility The help utility includes an HTML
36. do j 0000 0000 0 0000 0000 0000 0 0008 0 0000 0 0000 0 0000 0 K 094 0000 0 0000 0 0000 0 0000 E Sorted By Line a 1 3 aCi 3 b i k c k 3 enddo enddo enddo 099990909900000 ooooooooooooo ooooooooooooo H Device Number Accelerator Kernel Execution Count Grid Size 63x63 Block Size 16x16 Maximum time spent in Accelerator Kernel secs 0 020006 Minimum time spent in Accelerator Kernel secs 0 019964 Average time spent in Accelerator Kernel secs 0 019988 Parallelism I Histogram LO Compiler Feedback I System Configuration Accelerator Performance Profiled mm2 on Tue Mar 29 12 34 07 PDT 2011 Profile pgprof out K Figure 14 Source Level Profiling for an Accelerator Kernel For more information on tuning accelerator programs refer to the Using an Accelerator section of the PGI Compiler s User s Guide 2 9 2 Profiling CUDA Fortran Programs For CUDA Fortran pgcollect provides an filepath cuda that enables collection of performance data on the CUDA device Analysis of this performance data is much the same as for OpenAcc programs as described in the previous section except that the data is collected from counters on the device and in the CUDA driver If you are profiling a program that uses CUDA Fortran kernels running on a GPU pgcollect cuda collects performance data from the CUDA enabled GPU and includes it in the profile out
37. e config file gmem Collect global memory access statistics list List CUDA event names available for use in profile config file 9 7 3 Performance Tip On some Linux systems initialization of the CUDA driver for accelerator hardware that is in a power save state can take a significant amount of time You can avoid this delay in one of these ways gt Run the pgcudainit program in the background which keeps the GPU powered on and significantly reduces initialization time for subsequent programs For more information on this approach refer to the Using an Accelerator section of the PGI Compiler User s Guide PGI Profiler User Guide 61 pgcollect Reference gt Use the pgcollect option cudainit to eliminate much of the initialization overhead and to provide a more accurate profile pgcollect time cudainit myaccelprog In release 10 5 the option cudainit was called accinit These two options have exactly the same functionality PGI Profiler User Guide 62 Notice ALL NVIDIA DESIGN SPECIFICATIONS REFERENCE BOARDS FILES DRAWINGS DIAGNOSTICS LISTS AND OTHER DOCUMENTS TOGETHER AND SEPARATELY MATERIALS ARE BEING PROVIDED AS IS NVIDIA MAKES NO WARRANTIES EXPRESSED IMPLIED STATUTORY OR OTHERWISE WITH RESPECT TO THE MATERIALS AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE Information furnished is believed to be accurate and reliable Ho
38. e in the routine while thread 3 spent only 13 of the time there Performance might improve if the work could be distributed more evenly 2 6 2 Profiling MPI Programs To create and view a performance profile of your MPI application you must first build an instrumented version of the application using the Mprof option Some MPI distributions are supported directly in the compilers via Mprof sub options In these cases the MPI profiling options cannot be used alone They must be used in concert with another sub option of Mprof such as Lines func or time Other MPI distributions require compilation with MPI compiler wrappers The following table summarizes the options required for profiling with different MPI distributions PGI Profiler User Guide 17 Using PGPROF Table2 MPI Profiling Options Deprecated Mprof mpichl func li ti Deprecated Mprof mpich2 func 1i t MVAPICH2 Use MVAPICH2 compiler wrappers profile profcc proffer E Mprof func lines time MS MPI Mprof msmpi func lines Open MPI Use Open MPI compiler wrappers Mprof func lines time SGI MPI Mprof sgimpi func lines time For more details about how to compile an MPI program for profiling refer to the Using MPI section of the PGI Compiler User s Guide Once you have built an instrumented version of your MPI application running it as you normally would produces the MPI profile data On succes
39. e only on Linux PGI Profiler User Guide 20 Using PGPROF Modern x86 and x64 processors provide low level hardware counters that can be used to track the resource utilization of a program Tracking this information can be useful in tuning program performance because it allows you to go beyond just knowing where the program is spending the most time and examine why it is spending time there Linux systems do not provide hardware counter support by default These systems must have the OProfile package installed 2 8 1 Profiling with Hardware Event Counters Linux Only PGPROF supports hardware counter data collection through the execution of the program under the control of the pgcollect command Collection of profile data using pgcollect may be done on any linux86 or linux86 64 system where Oprofile is installed OProfile is included as an install time option with most Linux distributions it may also be downloaded from http oprofile sourceforge net No special build options are required to enable event based profiling with pgcollect although building with the option Minfo ccff may provide useful compiler feedback For specific information on using PGPROF with hardware event counters refer to pgcollect Reference 2 8 2 Analyzing Event Counter Profiles If you executed your program under the control of pgcollect then you can profile up to four event counters and view them in PGPROF For brief descriptions of what each hardware c
40. e the majority of time is spent build with Mprof func lt mpi gt Then build that portion of the program with Mprof lines lt mpi gt to isolate the performance problem On Linux86 64 platforms on which OProfile is installed once you have collected a time based profile using either instrumentation or sample based profiling consider further examining the resource utilization of those portions of code where the most time is spent You do this with event based sampling using the pgcollect command with event based sampling options as described in pgcollect Reference 1 4 Collect Performance Data To obtain the performance data required for PGPROF you must run your program gt If you use any method other than the pgcollect command to collect data run your program normally using a representative input data set If you use the pgcollect command to collect data refer to Basic Profiling for information on how to execute a profiling run of your program For specific details on pgcollect refer to pgcollect Reference 1 4 1 Profiling Output File In all profiling methods once the program s profiling run is complete a file named pgprof out is written to the program s working directory This file contains the performance data used by PGPROF to analyze the program s performance PGI Profiler User Guide 4 Getting Started 1 4 2 Using System Environment Variables You can use system environment variables to change the way profilin
41. efer to Profiler Invocation and Initialization For information about using the PGPROF text based command line interface refer to Compiler Options for Profiling For information about how to choose a profiling method build your program and execute it to collect profile data refer to Getting Started 7 1 PGPROF User Interface Overview On startup PGPROF attempts to load the profile datafile specified on the command line or the default pgprof out If no file is found a file chooser dialog box is displayed Choose a profile datafile from the list or select Cancel When a profile datafile is opened PGPROF populates the user interface as illustrated and labeled in Figure 16 Menu Bar Contains these menus File Edit View Sort and Help Toolbar Provides navigation shortcuts and controls for frequently performed operations Statistics Table Displays profile summary information for each profile entry Information can be displayed at up to three levels routine line or assembly depending on the type of profile data collected how the program was built and whether the PGPROF source file search path has been set to include the program source directories The initial view is the routine level view Focus Panel Consists of tabbed panes labeled Parallelism Histogram Compiler Feedback System Configuration and Accelerator Performance Information Bar Displays the profile summary information such as the name of the executable the
42. eference provides reference information about the PGPROF graphical user interface including information about the menus the toolbars and the subwindows Command Line Interface provides information about the PGPROF profiler command line interface language providing both a summary table and details about the commands The table includes the command name the arguments for the command and a brief description of the command all separated by area of use pgcollect Reference provides reference information about the pgcollect command It describes the PGPROF command line options and how to use them to configure and control collection of application performance data Conventions This guide uses the following conventions italic is used for emphasis Constant Width is used for filenames directories arguments options examples and for language statements in the text including assembly language statements Bold is used for commands item1 in general square brackets indicate optional items In this case item1 is optional In the context of p t sets square brackets are required to specify a p t set item2 item 3 braces indicate that a selection is required In this case you must select either item2 or item3 filename ellipsis indicate a repetition Zero or more of the preceding item may occur In this example multiple filenames are allowed FORTRAN Fortran language statements are shown in the text of this guide usin
43. erformance Data iii a t ed God nen d e E E da 36 6 3 2 Clock Granularly sec inito maa pin eim ate eed einem edad rq bein eq da a d er d n 36 0 3 3 Source Code Correlation tte t er teo ae o t EP M RE ENTER 37 Chapter 7 PGPROF Reference rte tenete tacet tefta il ataca 38 7 1 PGPROF User Interface Cverndem eene tentent tenete tenente tenens 38 L2 POPROF Menus intr irte te pacto n detti n mania deni a e p ed eo Dude a 39 121 Fille Merl ansia hee Mu ese i wap diua MER E De ic epa iR CER 39 12 2 Edit MENU HP 40 PL ME NANGANAK KAN NG BANSANG KLANG aan animes 41 1 24 Soft Mente mann o nian nuances aden iin nanan 42 T25 Help MON E 42 TS PGPROE Toda mamana retento aha Ad 43 TA PGPROF Statistics Table a na nanan o eme e n m E tono dao REF ua 44 TAA Performance Data VIEWS rr tor tet A te AN 44 4 2 Source Code Line NUM MO tii ao 45 0 PORRO OCUPE DI Nao Bio saan 45 FON Parallelism H E 46 To Le SOO DEE 46 7 5 3 Compiler Feedback tab tnter tenerent te terree te tenerent teen 46 TA System Configuration tab cti Kaban ma nG e c Pr ti qb E d ee t el t deb e eR blu 47 7 5 5 Accelerator Performance tab essent tette entente tte tenente tenens 4T Chapter 8 Command Line Interface die e rer EIE iaa 51 8 1 Command Description SYMAX s cou ict teles aia ipee eret ree rc d tr p e ere tle eed rcd redo 51 8 2 POPROF Command SUMMARY knna teg ente a rab tee cae ete cu e eb ca d te
44. es processor frequency CUDA driver version and NVRM version The Program s OS Target is the operating system platform that the executable was built for Although the processor may be a 64 bit processor the executable may target a 32 bit platform Tip If you need further explanations for any of these items refer to vendor processor documentation Accelerator tab Contains information about the GPU s that are on the system on which the profile was run If there is no GPU on this system the Accelerator tab is empty For each GPU also known as a device the Accelerator tab can include information such as the device name device revision number global memory set number of multiprocessors number of cores concurrent copy and execution total constant memory total shared memory per block registers per block warp size maximum threads per block maximum block dimensions maximum grid dimensions maximum memory pitch texture alignment and clock rate e Tip If you need further explanations for any of these items refer to vendor GPU documentation 7 5 5 Accelerator Performance tab This tab displays profiling information collected by pgcollect on for programs built using CUDA Fortran or the PGI Accelerator Model For more information on pgcollect refer to pgcollect Reference OpenACC Profiles The profiling information is relative either to an Accelerator Region or to an Accelerator Kernel PGI Profiler User Guide 47 PGPROF Refe
45. file CPU CLK UNHALTED DATA CACHE MISSES DATA CACHE REFILLS FRO Sorted By CPU CLK UNHALTED DATA CACHE REFILLS FRO paprof 4T out 4 993 60 BB 100 4 993 60 EN 100 CPU CLK UNHALTED M 491 680 000 iwan 100 491 680 000 MN 100 DATA CACHE MISSES 309 656 000 NN 100 309 656 000 ilag 100 DATA CACHE REFILLS FRO 223 456 000 100 223 456 000 100 DATA_CACHE_REFILLS_FRO go9 20 M 18 e99 20 M 18 899 20 18 830 72 17 723 28 14 58 40 1 195 816 000 MN 195 316 000 Ill 89 920 000 83 072 000 16 984 000 5 840 000 40 40 18 17 3 1 125 856 000 MN 41 125 856 000 MN 41 19 960 000 6 83 072 000 MW 27 16 984 000 5 5 840 000 2 125 856 000 MN 56 125 856 000 MY 5696 19 960 000 9 83 072 000 EN 37 16 984 000 8 5 840 000 3 e Parallelism Histogram Compiler Feedback System Configuration Accelerator Performance Profiled ammp on Mon Nov 01 17 18 48 PDT 2010 for 6 242 seconds Profile pgprof 4T out Figure 11 Profile with Hardware Event Counter 2 9 Profiling GPU Programs You can use PGPROF to analyze the performance of GPU programs GPU performance data is included in the profile pgprof out when a GPU program is run using pgcollect PGI provides two methods of programming GPUs OpenACC which uses programs and directives to tell the compi
46. g a reduced fixed point size C C C C language statements are shown in the test of this guide using a reduced fixed point size The PGI compilers and tools are supported on both 32 bit and 64 bit variants of the Linux OS X and Windows operating systems on a variety of x86 compatible processors There are a wide variety of releases and distributions of each of these types of operating systems PGI Profiler User Guide ix Preface Terminology If there are terms in this guide with which you are unfamiliar PGI provides a glossary of terms which you can access at http www pgroup com support definitions htm Related Publications The following documents contain additional information related to the X86 architecture and the compilers and tools available from The Portland Group gt PGI Fortran Reference Manual describes the FORTRAN 77 Fortran 90 95 and HPF statements data types input output format specifiers and additional reference material related to the use of PGI Fortran compilers gt System V Application Binary Interface Processor Supplement by AT T UNIX System Laboratories Inc Prentice Hall Inc gt FORTRAN 95 HANDBOOK Complete ANSI ISO Reference The MIT Press 1997 gt Programming in VAX Fortran Version 4 0 Digital Equipment Corporation September 1984 gt IBM VS Fortran IBM Corporation Rev GC26 4119 gt The C Programming Language by Kernighan and Ritchie Prentice Hall gt C A Reference
47. g is performed For more information on these variables refer to Environment Variables 1 4 3 Profiling with Hardware Event Counters You can also profile using hardware event counters For more specific information on this type of profiling refer to Profiling Resource Utilization with Hardware Event Counters 1 5 Profiler Invocation and Initialization PGPROF is invoked as follows 9 pgprof exe options datafile If invoked without any options or arguments PGPROF attempts to open a data file named pgprof out and assumes that application source files are in the current directory The program executable name specified when the program was run is usually stored in the profile data file If all program related activity occurs in a single directory PGPROF needs no options Probably the most common way to invoke the profiler is this o pgprof exe lt execname gt When you use this command to launch PGPROF gt Ifapgprof out file exists in the current directory PGPROF opens it and uses lt execname gt to display the profile data gt fnopgprof out file exists in the current directory no profile data is displayed However when the user selects the menu File Open Profile the Text Field for Executable is set with lt execname gt in the dialog For information on all available profiler options and how they are interpreted refer to Compiler Options for Profiling For information on the command line options f
48. gram this measurement may be displayed in the Statistics Table in a column with one of these heading titles Time if you used Mprof func Mprof lines or Mprof time CPU CLK UNHALTED if you used pgcollect Important Profiling multi process MPI programs with the pgco11ect command is not supported The number of processes and or threads used in each execution can be different After generating two or more profiles load one of them into PG PROF Select the Scalability Comparison item under the File menu described in File Menu or click the Scalability Analysis button in the Toolbar Choose a second profile for comparison A new instance of PGPROF appears with a column named Scale in the Statistics Table PGI Profiler User Guide 19 Using PGPROF Figure 10 shows the profile of a run that used four threads with Scalability Comparison to the same program run with a single thread r pgprof S File Edit View Sort Help ads Find zl db op Hotspot Seconds CF pgprof 4T time out 0 6600060606000 O S Process Thread Browser for application ammp Profile Seconds Function Max Seconds mp barrier 40 31 07 mm fv update nonbon 39 17 mp barrierp 20 950 f nonbon V 5 37 a next 3 81 f_torsion f angle tpac T bond T box a inactive f zero a m serial f hybrid pgprof 4T time out 100 Po 93 355 0 0 0 08 100 Y Process Thread Viewer for routine
49. ices of your arrays If so you may want to override the compiler to copyin copyout the entire array PGI Profiler User Guide 23 Using PGPROF r pgprof LE eeng File Edit View Sort Help BSA lt Find amp Hotspot Seconds ERES pgprof out Function Max Seconds Max Accelerator Region Time Max Accelerator Kernel Time sstk 1667 MEM 54 0 0000 0 0 0000 0 System Time 0476M 15 0 0000 0 0 0000 0 Open nocance 0238 8 0 0000 0 0 0000 0 LunTock 1354 0238l 8 0 0000 0 0 0000 0 fchmodat 0238 8 0 0000 0 0 0000 0 L Tock 106 0238 8 0 0000 0 0 0000 0 mmi 0000 0 0 3066 BENI 100 0 0872 MN 100 Sorted By Seconds Device Number 0 Accelerator Initialization Time secs 0 204147 Accelerator Kernels Time secs 0 087231 Data Transfer Time secs 0 015217 Parallelism Histogram Compiler Feedback System Configuration Accelerator Performance Profiled mm2 on Wed Nov 03 15 20 17 PDT 2010 Profile pgprof out Figure 12 Accelerator Performance Data for Routine Level Profiling Example For more information on compiler feedback refer to Compiler Feedback Region Level Analysis As with host only profiles you can drill down to the source code level by double clicking on the routine name in the Function column For an accelerator program the display centers on the accelerator region directive for the longest executing region The Accelerator Performan
50. ied in the dialog box If the Source Path is the only parameter that is changed from current session parameters then the current session uses the new Source Path to search for sources PGI Profiler User Guide 39 PGPROF Reference gt Set Source Directory Select this option to add or remove a directory in the source file search path gt Scalability Comparison Select this option to open another profile for scalability comparison As you did for the Open Profile option described above provide information about the profile data file the executable file and the location of the source files Notice that the new profile contains a Scale column in its Statistics table Another method to open profiles for scalability comparison is by using the scale command line option explained in Profiler Invocation and Initialization For more information on scalability refer to Scalability Comparison gt Print Select this option to make a hard copy of the current profile data The profiler processes data from the Statistics table and sends the output to a printer A printer dialog box appears You can select a printer using the Name drop down list under Print Service Alternately click the Print To File check box to send the output to a file Other print options may be available however they are dependent on the specific printer and the Java Runtime Environment JRE gt Print to File Option output is not sent to p
51. igation HotSpot navigation Sorting profile data Compiler Feedback Profiling parallel programs including multi threaded and MPI programs Scalability comparison Profiling resource utilization with hardware event counters Profiling accelerator programs Y vV ZZ WV voy Vw PGI Profiler User Guide 7 Using PGPROF gt Max DATA CACHE MISSES KA 109 335 000 17 104 E Statistics Table ox v CACHE C DATA CACHE REFILLS FP KR Gs r gem KG o 22 486 000 MN 100 a Focus DATA CACHE MESES DATA CACHE REFILLS FROM DATA CACHE REF panel p barrier 18x 195 816 000 MN SA 125 856 000 vro 16X 195 816 000 89808 40x 125 855 000 i8xX 9 9 Profilet ammp on Mon ov O Information bar Figure 1 PGPROF Overview 2 1 PGPROF Tabs and Icons Overview Before we describe how to navigate within PGPROF it is useful to have some common terminology for the tabs and icons that you see within the application Closeable and Non closeable Tabs PGPROF displays both closeable and non closeable tabs For example when you first invoke PGPROF you see the function level statistics table in a panel with a non closeable tab Then to access profiling data specific to a given function you double click on the function name and a closeable tab opens with source code and profiling statistics for that function This closeable tab navigation approach provides a way for you to ea
52. iles e g gmon out specified with this option PGPROF assumes that all profile data was generated by the same executable For information on how to specify multiple executables in a sample based scalability comparison see the Scalability Comparison item in the description of the File Menu text Use the PGPROF Command Line Interface CLI V Print version information 4 2 Profiler Invocation and Startup Let s take a look at some common ways to invoke the profiler describing what each launch command means pgprof gt Ifapgprof out file exists in the current directory PGPROF tries to open it gt If an executable name can be determined from the pgprof out file the GUI is populated according to profile data if valid gt If an executable name can NOT be determined from the pgprof out file then a dialog is opened on top of the main window with the following message Can t determin xecutable for file pgprof out Please use File Open Profile menu to specify one gt fnopgprof out file exists in the current directory the GUI is not populated and no dialog appears PGI Profiler User Guide 32 Command Line Options pgprof exe lt execname gt gt Ifapgprof out file exists in the current directory PGPROF tries to open it and use execname Further the GUI is populated according to profile data if valid gt fnopgprof out file exists in the current directory the GUI is not populated and
53. imized you may notice that performance data is only shown for a subset of the source lines For example a multi line loop may only have line level data for the first line of the loop PGI Profiler User Guide 10 Using PGPROF r pgprof lola ea File Edit View Sort Help BSA Eei Find sl db op HotSpot Seconds Y fft Line a Source Max Seconds 509 nxp2 nxp 2 0 000 0 DO 510 do 20 m 1 nxp2 0 082 1 511 if inverse then 0 000 0 512 wk conja wcn 0 000 0 513 else 0 000 0 514 wk w n 0 000 0 515 end if 0 000 0 DO 516 do 10 mxp nxp nx nxp 8 439 NN 56 517 jl mxp nxp m 0 000 0 518 j2 ji nxp2 0 000 0 519 t x ji x j2 0 000 0 520 xD x j1 x j2 0 000 0 521 x 32 t wk 0 000 0 522 c print it 1t m m mxp mxp jl j1 j2 32 0 000 0 10 523 10 continue 0 969 6 524 n n itab it 0 000 0 Sorted By Line S Process Thread Browser for routine Tt Routine Seconds fft 16 653 M0 27 gt Po 16 653 S Process Thread Viewer for line 501 Line Seconds 501 0 020 0 gt Po 0 020 0 e Parallelism Histogram Compiler Feedback System Configuration Accelerator Performance Profiled fftpde on Mon Nov 01 17 46 50 PDT 2010 Profile pgprof out L d Figure 3 Source Code View In the optimization process the
54. ine accelerator performance data using PGPROF including function level analysis region level analysis and kernel level analysis A comprehensive guide to tuning accelerator programs is beyond the scope of this manual Function Level Analysis When you invoke PGPROF on the profile of an accelerator program the initial view displays a function list showing host times in the Seconds column and accelerator times in the Accelerator Region Time column and Accelerator Kernel Time column Figure 12 illustrates a routine level view with the routine jacobi selected and the Accelerator Performance tab chosen in the Focus Panel One of the first things to look at in tuning an accelerator program is whether the Data Transfer Time is large relative to the Accelerator Kernels Time In the example illustrated in Figure 12 the Accelerator Kernels Time of 4 134521 seconds is much larger than the Data Transfer Time of 0 132602 seconds so we have efficient use of the accelerator If data transfer time is taking a significant portion of the total time you would want to investigate if transfer time could be reduced using data regions described in Section 7 Using an Accelerator of the PGI User s Guide If data transfer time is relatively high and you have already considered data regions you might want to examine the Compiler Feedback You must compile with Minfo ccff to be able to do this Check if the compiler is generating copyin copyout operations that use sl
55. information about these events refer to the opcontrol man page x64 processors provide numerous event counters that measure the usage of a variety of processor resources Not all processors support the same set of counters To see which counters are supported on a given system use the following command pgcollect list events The output of this command also provides information on event masks the hex value in the event specification and minimum overflow values Here are two examples of shell functions providing event specifications to pgcollect These functions would be implemented in a pgoprun file Custom Event Example 1 This function specifies the events needed to calculate cycles per instruction CPU CLK UNHALTED RETIRED INSTRUCTIONS The fewer cycles used per instruction the more efficient a program is Goi data 0 Y event S event event S event event CPU CLK UNHALTED 500000 0x00 0 1 event RETIRED INSTRUCTIONS 500000 0x00 0 1 To use these events invoke pgcollect with the following arguments es function cpi data Custom Event Example 2 This function specifies events needed to determine memory bandwidth mem bw data event event event event event event event event event CPU CLK UNHALTED 500000 0x00 0 1 event SYSTEM READ RESPONSES 500000 0x07 0 1 event QUADWORD WRITE TRANSFERS 500000 0x00
56. is button is the same as selecting File Scalability Comparison from the menu bar gt Forward and Back buttons click these buttons to navigate forward and back to previous and subsequent views respectively Use the down arrow to display the full list of views and to select a view to jump to These lists use a notation to describe the profile views as follows profile data file8source file routine line address The address field is omitted for line level views and both the line and address fields are omitted for routine level views For example the following item in a list would describe a view that uses profile data from pgprof out and is displaying line 370 in the routine named solver in source file main f pgprof out main f solver 370 gt Search controls use these to locate information The controls include gt A text box labeled Find Entering a search string here and hitting Enter is the same as using the dialog box invoked from the Edit Search Forward menu bar item gt Two buttons labeled with down and up arrows respectively These buttons provide Search Next and Search Previous operations similar to Edit Search Again Search Next searches for the next occurrence of the last search string below the current location and Search Previous searches for the next occurrence above gt HotSpot Navigation controls use these to navigate to the most significant measurements taken in the profiling run The
57. ization using Mpfi Mpfo in the Optimizing and Parallelizing section of the PGI Compiler User s Guide Profile guided optimization can often determine loop counts and other information needed to calculate the Compute Intensity for a given statement or loop Messages for Inlined Routines Inlined functions are identified by CCFF messages These Compiler Feedback messages for routines that have been inlined are associated with the source line where the routine is called Further these messages are prefixed with the routine and line number and are indented to show the level of inlining Currently there is not a way to view the source code of that inlined instance of the routine 2 6 Profiling Parallel Programs You can use PGPROF to analyze the performance of parallel programs including multi threaded and OpenMP programs multi process MPI programs and programs that are a combination of the two PGPROF also provides a Scalability Analysis feature that allows you to compare two profiling runs and thus determine how well different parts of your program scale as the number of threads or processes changes PGI Profiler User Guide 15 Using PGPROF 2 6 1 Profiling Multi threaded Programs Multi threaded programs that you can profile using PGPROF include OpenMP programs built with mp auto parallelized programs built with Mconcur and programs that use native thread libraries such as pthreads Collecting Data from Multi Threaded Progra
58. ler how to generate GPU code and CUDA Fortran which is used to program the GPU more directly The next section describes how to use pgcollect with OpenACC programs and the subsequent section describes using it with CUDA Fortran programs 2 9 1 Profiling OpenACC Programs For OpenACC the profiling procedure is the same as for host only programs except that PGPROF provides an Accelerator Performance tab that allows you to review profiling information provided by the accelerator You do not need to build or run with any special options to collect accelerator performance data PGI Profiler User Guide 22 Using PGPROF Here is an example of the commands you might use in a simple accelerator profiling session pgfortran ta nvidia o myprog myprog f90 pgcollect time myprog pgprof exe myprog You can build your program to print GPU performance data to standard output by using the time suboption to the target accelerator option ta For example you can use this command pgfortran ta nvidia time myprog f90 The t ime suboption has no effect on pgcollect or PGPROF profiling For more information on using PGI compilers to build programs for accelerators and on related terminology refer to Section 7 Using an Accelerator of the PGI Compiler User s Guide For more information on pgcollect refer to pgcollect Reference Analyzing Accelerator Performance Data This section provides a basic description of how to exam
59. ltiple accelerator regions In this instance only time spent in Initialization in the Region and in the Kernel can be accurately computed so other values are not displayed in the Accelerator Performance tab Accelerator Kernel Timing Information Time is reported in seconds When you select a non zero Accelerator Kernel Timing item you see the following information in the Accelerator Performance tab gt gt Kernel Execution Count execution count for the selected kernel Grid Size the size in 1D X or 2D XxY of the grid used to execute blocks of threads for the selected kernel Block Size the size in 1D X 2D XxY or 3D XxYxZ of the thread blocks for the selected kernel Maximum time spent in accelerator kernel the maximum time spent in a single execution of selected kernel Minimum time spent in accelerator kernel the minimum time spent in a single execution of selected kernel Average time spent in accelerator kernel the average time spent per execution of selected kernel When there are multiple invocations of the same kernel in which the grid size and or block size changes the size information displayed in the Accelerator Performance tab is expressed as a range For example if the same kernel could be executed with a 2D block of size 2 64 and a 2D block of size 4 32 then the size displayed in Accelerator Performance tab is the range 2 4 32 64 CUDA Fortran Profiles Profiles generated by
60. me secs 0 011285 4 0 024057 0 023942 0 024001 Accelerator Performance Compiler Feedback Profiled mm 2 on Tue Mar 29 12 34 07 PDT 2011 Profile pgprof out Parallelism Histogram System Configuration Figure 13 Source Level Profiling for an Accelerator Region Kernel Level Analysis Since an accelerator region can contain multiple distinct kernels you may want to examine performance data for an individual kernel You do this by selecting the first source line of the kernel In the source level view the first line of a kernel has data listed in the Accelerator Kernel Time column To navigate to the longest executing kernel 1 Select Accelerator Kernel Time in the HotSpot selector in the upper right portion of the user interface 2 Click the double left arrow lt lt located next to the HotSpot selector In Figure 14 the selected line in the main Statistics Table has a value only in the Accelerator Kernel Time The Accelerator Performance tab displays all the details for the Accelerator Kernel performance data PGI Profiler User Guide 25 Using PGPROF ES pgprof balak File Edit View Sort Help Bas Find D dh HotSpot Seconds D YX R rofout mml 8 Source Max Seconds Max Accelerator Region Time Max Accelerator Kernel Til subroutine mmi a b c m 0000 0000 0 0000 real dimension a b c 0000 0000 0 0000 1 acc region 0000 2207 BEI 100 0000
61. ms Some methods of performance data collection work better with multi threaded programs than others As always the recommended approach is to use pgcollect initially with time based sampling optionally followed by event based sampling Building with Minfo ccf f is always a good idea when using pgcollect Alternatively building with the compiler option Mprof 1ines creates a program that collects accurate multi threaded performance profiles The Mprof func option works with multi threaded programs Routines that contain one or more parallel regions appear in a profile as if they were run on a single thread because the data collection is at the entry and exit of the routine when the parallelism is not active The Mprof time and pg options generate programs that only collect data on a single thread To collect data for programs built using Mprof run your program normally Upon successful termination a pgprof out file is created Analyzing the Performance of Multi Threaded Programs The display of profile data for a multi threaded program differs from that of a single threaded program in a couple of ways gt In the Statistics Table the data shown is the maximum value for any single thread in the process gt The Parallelism tab shows the thread specific performance data for the row selected in the Statistics Table whether the Statistics Table is in the routine level line level or assembly level view Click the arrow icon
62. nt When using pgcollect for event based profiling you control the OProfile kernel driver and the sample collection daemon via the OProfile command opcontrol This control requires root privileges for management operations Thus invocations to opcontrol performed by pgcollect are executed via the sudo command When using pgcollect you control the OProfile kernel driver and the sample collection daemon via the OProfile command opcontro1 This control requires root privileges for management operations Thus invocations to opcontrol which are performed when pgcollect is used are executed via the sudo command One technique that requires minimal updates to the etc sudoers files is to assume that all users in a group are allowed to execute opcontrol with group privileges For example you could make the following changes to etc sudoers to permit all members of the group sw to run opcontrol with root privileges User alias specification User Alias SW sw SW ALL NOPASSWD usr bin opcontrol PGI Profiler User Guide 58 pgcollect Reference 9 6 2 Interrupted Profile Runs pgcollect shuts down the OProfile daemon when interrupted However if the script is terminated with SIGKILL you must execute the following pgcollect shutdown Executing this command is important because if the OProfile daemon is left running disk space on the root file system eventually is exhausted 9 6 3 Event based Profiling Options check events D
63. nt bytes recv visits file s tat no min no avg no max no proc no thread Set which process fields to display or not to display when noJall using the arguments beginning with no th read Specify a thread for a multi threaded process profile t imes raw pct Specify whether time related values should be displayed as raw numbers or as percentages The default is pct history num num string Repeat recent commands 8 3 Command Reference This section provides more details about the commands in the previous Command Summary Table asm a sm routine gt filename Display the instruction and line level data together with the source and assembly for the specified routine If the filename argument is present the output is placed in the named file The gt means redirect output and is optional This command is only available on platforms that support assembly level profiling ccff c cff file function line number PGI Profiler User Guide 52 Command Line Interface Display compiler feedback for the specified file function or source line PGI compilers can produce information in the Common Compiler Feedback Format CCFF that provides details about the compiler s analysis and optimization of your program Often this information can illuminate ways in which to further optimize a program The CCFF information is produced by default when using the Mprof compiler option but if you are
64. ntax is Mprof option option You use Mprof to set performance profiling options Use of these options causes the resulting executable to create a performance profile that can be viewed and analyzed with the PGPROF performance profiler e If you use pgcollect to gather performance data you do not need to compile or link with Mp ro f 3 2 Profiling Compilation Options In the descriptions that follow instrumentation based profiling implies compiler generated source instrumentation profiling implies the use of instrumented wrappers for MPI library routines Minfo ccff Generate compiler feedback information and store it in object and executable files for later access by performance tools Use Minfo ccff when collecting performance data using pgcollect All Mprof options except Mprof dwarf imply Minfo ccff Mprof dwarf Generate a subset of DWARF symbol information adequate for viewing source line information with most performance profilers In the PGI compilers Mprof dwarf is on by default You can use the Mnodwarf option to disable it Source level information is not available if you profile a program built with Mnodwarf PGI Profiler User Guide 29 Compiler Options for Profiling Mprof func Perform routine level instrumentation based profiling Mprof lines Perform instrumentation based line level profiling Mprof mpich Use the default MPICH v3 libraries on Linux and OS X for profiling Implies
65. o the table in the Accelerator Performance tab contains details about that event The information displayed depends on the selection If a user selects a line in which both events are non zero then the Accelerator Performance tab displays only Accelerator Initialization Time Accelerator Region Time and Accelerator Kernel Time Accelerator Region Timing Information Time is reported in seconds When you select a non zero Accelerator Region Timing item you see the following information in the Accelerator Performance tab Accelerator Initialization Time time spent in accelerator initialization for the selected region gt Accelerator Kernel Time time spent in compute kernel s for the selected region Data Transfer Time time spent in data transfer between host and accelerator memory gt Accelerator Execution Count execution count for the selected region gt Maximum time spent in accelerator region w o init the maximum time spent in a single execution of selected region PGI Profiler User Guide 48 PGPROF Reference Minimum time spent in accelerator region w o init the minimum time spent in a single execution of selected region Average time spent in accelerator region w o init the average time spent per execution of selected region The table does not contain values that are not relevant such as zero values or values that cannot be computed For example in a routine level profile a routine can execute mu
66. o not execute a profiling run just check the event settings specified on the command line exe lt exename gt Specify the program to be profiled You only need to use exe when the program argument is a script that invokes the program list events List profiling events supported by the system shutdown Shut down the profiling interface You only need to use this option in rare cases when a profiling run was interrupted and OProfile was not shut down properly Predefined Performance Data Collection Options allcache Profile instruction data and branch cache misses dcache Profile various sources of data cache misses imisses Profile instruction cache related misses hwtime lt millisecs gt Provide time based sampling only Specify the sampling interval in milliseconds User Defined Performance Data Collection Options es function lt name gt Set profile events via a shell function event lt spec gt Manually add an event profile specification An event profile specification is an opcontrol event argument that is the event profile specification provided on the command line is appended to event and passed as an argument to opcontrol post function lt name gt Execute a shell function after profiling is complete PGI Profiler User Guide 59 pgcollect Reference 9 6 4 Defining Custom Event Specifications Thepgcollect event EVENTSPEC options are accumulated and used to specify events to be measured For more
67. or the Profiler refer to Command Line Options For sample launch commands refer to Profiler Invocation and Startup 1 6 Application Tuning So how do you make your program run faster The process of tuning your program ranges from simple to complex gt In the simple case you may be able to easily tune the application and improve performance dramatically simply by adding a compiler option when you build The Compiler Feedback and System Configuration tabs in the PGPROF user interface contain information that can help identify these situations gt Ina slightly more challenging scenario you may need to restructure part of your code to allow the compiler to optimize it more effectively For instance the Compiler Feedback for a given loop may provide a hint to remove a call from the loop If the call can be moved out of the loop or inlined the loop might be vectorized by the next compile PGI Profiler User Guide 5 Getting Started gt More difficult cases involve memory alignment and algorithm restructuring These issues are beyond the scope of this manual 1 7 Troubleshooting If you are having trouble during invocation or the initialization process use the following sections for tips on what might be causing your problem 1 7 1 Prerequisite Java Virtual Machine PGPROF depends on the Java Virtual Machine JVM which is part of the Java Runtime Environment JRE PGPROF requires that the JRE be version 1 6 or above Linux
68. os OS X When PGI software is installed on Linux or OS X the version of Java required by the profiler is also installed PGPROF uses this version of Java by default You can override this behavior in two ways set your PATH to include a different version of Java or set the PGI_JAVA environment variable to the full path of the Java executable The following example uses a bash command to set PGI JAVA S export PGI JAVA home myuser myjava bin java Windows If an appropriately versioned JRE is not already on your system the PGI software installation process installs it for you The PGI command shell and Start menu links are automatically configured to use the JRE If you choose to skip the JRE installation step or want to use a different version of Java to run the profiler then set your PATH to include the Java bin directory or use the PGI JAVA environment variable to specify the full path to the java executable 1 7 2 Slow Network If you are viewing a profile across a slow network connection or a connection that does not support remote display of Java GUIs consider using the PGPROF command line interface described in Command Line Interface PGI Profiler User Guide 6 Chapter 2 USING PGPROF In Getting Started you learned how to choose a profiling method build your program and execute it to collect profile data This section provides a more detailed description of how to use the features of PGPROF in particular Profile nav
69. ounter measures use pgcollect list events For more detailed information see the processor vendor s documentation Figure 11 shows a profile of four event counters CPU_CLK_UNHALTED DATA CACHE MISSES DATA CACHE REFILLS _FROM_L2 DATA CACHE REFILLS FROM SYSTEM In this example the routine using the most time is also getting many cache misses Investigating the memory access behavior in that routine and looking at the Compiler Feedback may offer some clues for improving its performance PGI Profiler User Guide 21 Using PGPROF pgprof 4T out 0 pgprof dla File Edit View Sort Help ba e Find lr 9 HotSpot CPU_CLK_UNHALTED amp 9 Function Max CPU_CLK_UNHALTED Max DATA_CACHE_MISSES Max DATA_CACHE_REFILLS_FROM_L2_OR_ mp barrierp mm fv update nonbon a m serial mp barrier tether f nonbon mp barrierw f torsion read eval do eval strcmp v nonbon Y coo oo OO 3 167 120 000 MN 1 483 920 000 Il 1 186 800 000 ll 899 200 000 768 320 000 183 840 000 150 320 000 84 720 000 76 880 000 76 160 000 70 080 000 53 040 000 109 336 000 Bl 104 984 000 Il 616 000 195 816 000 MN 21 400 000 12 616 000 7 392 000 4 816 000 3 376 000 3 360 000 1 008 000 1 240 000 2 728 000 90 176 000 MN 3 536 000 125 856 000 MN 45 264 000 E 10 560 000 504 000 2 984 000 2 992 000 2 760 000 376 000 736 000 la Pro
70. pecified by the environment variable HOME E The name of the file depends on your OS gt On Windows the name of the file is mypgi rc On Linux and OS X the name of the file is mypgirc In this file you put a line that indicates compute capability 1 3 or 2 0 COMPUTECAP 13 ue COMPUTECAP 2 0 Placing this line in this file also affects the compiler defaults with respect to compute capability Performance Profiling with User defined Counter Configurations You have the ability to specify which counters to use in data collection To do this you create a profile configuration file with any filename You can do this using this command pgcollect cuda list To specify the counters to use place a list of NVIDIA counters in your file listing one counter per line In general the number of counters you can list is limited to four although with compute capability 2 0 you may be able to use more depending on the counters selected In addition you may always list certain data collection filepaths that do not depend on hardware counters such as these gridsize stasmemperblock threadblocksize regperthread dynsmemperblock memtransfersize PGI Profiler User Guide 27 To get a full list of the counters available use this command pgcollect cuda list Using PGPROF E pgprof File Edit View Sort Help ais pgprof out Find d HotSpot Seconds Function Max Seconds
71. pgcollect for CUDA Fortran programs capture data from GPU performance counters The specific counters available for a given GPU depend on the GPU s compute capability In Figure 20 gt The columns labeled Max CUDA GPU Secs and Max CUDA CPU Secs show times captured by the CUDA driver gt The Max Seconds column contains timings for host only code gt Pseudo function names Data Transfer To Host and Data Transfer To Device show the transfer times to and from the GPU PGI Profiler User Guide 49 PGPROF Reference gt The Accelerator Performance Tab shows counter values collected from the GPU E pgprof SS A O E File Edit View Sort Help pgprof out Function Max Seconds v Max CUDA GPU Secs Max CUDA CPU Secs sstk System Time open_nocancel mmul_drv read nocancel Data Transfer to Host mmul kernel Data Transfer to Device 0 3077 MEN 0 0385 0 0192 0 0192 0 0192 0 0000 0 0000 0 0000 76 10 5 5 5 0 0 0 0 0000 0000 0000 0000 0017 0126 MN 0048 li 0000 0 0 0 0 0 9 66 25 0000 0000 0000 0000 0000 0050 Ml 0145 MN 0053 fl oooooooo Device Number 0 CUDA Kernel Call Count 100 Grid Size Thread Block Size Avg Regs Thread 14 Avg Occupancy 1 000000 Coalesced Loads 163 840 Coalesced Stores 40 960 Avg Static Shmem Block 210 400 8x8 16x16x1 Sorted By Seconds
72. profiling with the pgcollect command you must build your program with the Minfo ccff compiler option to produce this information display d isplay display options all none Specify display information This includes information on minimum values maximum values average values or per processor thread data Below is a list of possible display options no calls no cover no time no timecall no cost no proc no thread no msgs no msgs_sent no msgs recv no bytes no bytes sent no name no file no line no lineno no visits no scale no stmtno help he lp command Provide brief command synopsis If the command argument is present only information for that command is displayed The character may be used as an alias for help history h istory size Display the history list which stores previous commands in a manner similar to that available with csh or dbx The optional size argument specifies the number of lines to store in the history list lines l ines function gt filename Display the line level data together with the source for the specified function If the filename argument is present the output is placed in the named file The gt means redirect output and is optional load lo ad datafile Load a new dataset With no arguments reloads the current dataset A single argument is interpreted as a new data file With two arguments the first is interpreted as the program and the second
73. pt are either the filename of the program to be profiled or the name of a script that invokes the program When applicable you can provide arguments for the specified program or script program args or program or script args The following sections describe the pgcollect command line options in more detail 9 3 Build for pgcollect If your program was built with PGI compilers you do not need to use any special options to use pgcollect However if your programs are built using the Minfo ccff option then PGPROF can correlate compiler feedback and optimization hints with the source code and performance data If you built your program using a non PGI compiler consider building with debugging information so you can view source level performance data Be aware however that building with debugging information may change the performance of your program 9 4 General Options This section describes options that apply to all forms of the pgcollect command For options specific to controlling time based or event based profiling refer to Time Based Profiling and Event Based Profiling respectively V Display the version of pgcollect being run help Show pgcollect usage and switches 9 5 Time Based Profiling Time based profiling runs the program using time based sampling This form of pgcollect uses operating system facilities for sampling the program counter at 10 millisecond intervals PGI Profiler User Guide 57 pgcollect Referen
74. put for the program The syntax for this command filepath is cuda gmem branch cfg lt cfgpath gt ccl3 cc20 list The sub filepaths modify the behavior of pgcollect cuda as described here branch Collect branching and warp statistics cc13 Use counters for compute capability 1 3 default cenm Use counters for compute capability n m Usepgcollect help to see which compute capabilities your system supports cfg lt cfgpath gt Specify lt cfgpath gt as CUDA profile config file gmem Collect global memory access statistics PGI Profiler User Guide 26 Using PGPROF list List cuda event names available for use in profile config file Performance Profiling with Pre defined Counter Configurations The gmem and branch sub filepaths initiate profiling with predefined sets of performance counters to measure specific areas of GPU resource utilization gt gmem measures accesses to global memory gt branch tracks divergent branches and thread warp statistics Some of the counters used for gmem and branch differ depending on the version compute capability of the GPU you are using To ensure that you use the counters available on your GPU you must specify the compute capability you want to use You can do this in two ways gt On the pgcollect command line For example to specify compute capability 1 3 you can use pgcollect cuda branch cc13 myprog gt Ina special file in your home directory The home directory is s
75. rence Accelerator Region An accelerator region is a region of code that has been executed on the accelerator device An accelerator region might transfer data between the host and the accelerator device Further an accelerator region can be split into several accelerator kernels Accelerator Kernel An accelerator kernel is a compute intensive highly parallel portion of code executed on an accelerator device Each compiler generated kernel is code executed by a block of threads mapped into a grid of blocks Figure 19 illustrates one possible display for the Accelerator Performance tab one that is relative to the Accelerator Kernel Device Number 0 Accelerator Kernel Execution Count 4 Grid Size 63x63 Block Size 16x16 Maximum time spent in Accelerator Kernel secs 0 000216 Minimum time spent in Accelerator Kernel secs 0 000206 Average time spent in Accelerator Kernel secs 0 000209 Parallelism Histogram Compiler Feedback System Configuration Accelerator Performance Profiled mm2 on Tue Mar 29 12 34 07 PDT 2011 Profile pgprof out Figure 19 Accelerator Performance tab of Focus Panel PGPROF displays two Accelerator events in the Statistic table gt Accelerator Region Time the time in seconds spent in the Accelerator region gt Accelerator Kernel Time the time in seconds spent in the Accelerator kernel When a user selects a line for which one of these events is non zer
76. rformance data always introduces some overhead or intrusion that can affect the behavior of the application being monitored How this overhead affects the accuracy of the performance data depends on the performance monitoring method chosen system software and hardware attributes the load on the system during data collection and the idiosyncrasies of the profiled application Although the PGPROF implementation attempts to minimize intrusion and maximize accuracy it would be unwise to assume the data is beyond question 6 3 2 Clock Granularity Many target machines provide a clock resolution of only 20 to 100 ticks per second Under these circumstances a routine must consume at least a few seconds of CPU time to generate meaningful line level times PGI Profiler User Guide 36 Data and Precision 6 3 3 Source Code Correlation At higher optimization levels and especially with highly vectorized code significant code reorganization may occur within functions The PGPROF profiler allows line profiling at any optimization level In some cases the correlation between source and data may at times appear inconsistent Compiling at a lower optimization level or examining the assembly language source may help you interpret the data in these cases PGI Profiler User Guide 37 Chapter 7 PGPROF REFERENCE This section provides a reference guide to the features of the PGPROF performance profiler For information about how to invoke PGPROF r
77. rinter but is formatted as an editable text file After selecting this menu item a Save File dialog box appears Enter or choose an output file in the dialog box Click Cancel to abort the print operation gt Close Select this option to close the current profiling session This option is enabled only when more than one profile is open gt Exit Select this option to end the profiling session and exit the profiler 7 2 2 Edit Menu Use the Edit menu to launch a text search in the Statistics Table and to restore revert or save user preference settings This menu contains the following items gt Search Forward Displays a dialog box that prompts for the text to be located Once the text is entered and the OK button selected PGPROF searches forward to the next occurrence of the text in the function list source code or assembly code displayed in the Statistics Table Matching text is displayed in red A search can also be invoked using the Find text box on the main toolbar gt Search Backward Displays a dialog box that prompts for the text to be located Once the text is entered and the OK button selected PGPROF searches backward to the previous occurrence of the text in the function list source code or assembly code displayed in the Statistics Table Matching text is displayed in red gt Search Again Use this option to repeat the last search gt Clear Search Use this option to clear the search and turn
78. rmation is listed by process Each process can be expanded to reveal profiling information by each thread in that process To expand a process into its threads click on the gt icon on the left of the P icon 7 5 2 Histogram tab This tab displays a histogram of one or more profiled data items Each bar graph corresponds to one of the performance measurements gt Fach vertical bar corresponds to a profile entry that is performance data associated with a program location gt The bars are sorted in the order specified in the Sort menu described in Sort Menu and the current sorting metric is labeled in the lower right hand corner of the table itself gt Clicking on a bar displays information for the corresponding profile item in the Statistics Table gt Double clicking on a bar drills down into the profile for the portion of the program corresponding to the bar gt Selected bars are highlighted in yellow 7 5 3 Compiler Feedback tab This tab displays information provided by the compiler regarding the characteristics of a selected piece of the program including optimization and parallelization information obstacles to optimization or parallelization and hints about how to improve the performance of that portion of the code Such information is available at the line level and the routine level If Compiler Feedback information is available round blue buttons containing a lower case 1 are displayed on the lef
79. rrelates execution time to code showing the amount of time spent in each routine each source line and each assembly instruction in the program For more information on time based profiling refer to Time Based Profiling Event based sampling Supported only on linux86 64 systems creates an event based profile that correlates hardware events to program source code In this method pgcollect uses hardware event counters supported by the processor to gather resource utilization data such as cache misses e This method requires co installation of the open source performance tool OProfile For more information on event based profiles refer to Event Based Profiling Both forms of the pgcollect command gather performance data that can be correlated to individual threads including OpenMP threads as well as to shared objects dynamic libraries and DILs For current availability of pgcollect and pgcollect features on a given platform refer to the PGI Release Notes PGI Profiler User Guide 56 pgcollect Reference 9 2 Invoke pgcollect The command you use to invoke pgcollect depends on the type of profile you wish to create Use the following command to invoke pgcollect for time based sampling pgcollect time program program args Use the following command to invoke pgcollect for event based sampling available on Linux86 64 pgcollect event options gt program or script program or script args programor program or scri
80. s data parallel extensions based on the High Performance Fortran HPF defacto standard The PGI Fortran Reference Manual describes Fortran statements and extensions as implemented in the PGI Fortran compilers PGPROF permits profiling of serial and parallel multi threaded OpenMP and or MPI programs compiled with PGI compilers PGI Profiler User Guide vii Preface For further information refer to the following gt American National Standard Programming Language FORTRAN ANSI X3 1978 1978 gt ISO IEC 1539 1991 Information technology Programming Languages Fortran Geneva 1991 Fortran 90 gt ISO IEC 1539 1997 Information technology Programming Languages Fortran Geneva 1997 Fortran 95 gt High Performance Fortran Language Specification Revision 1 0 Rice University Houston Texas 1993 http www crpc rice edu HPFF gt High Performance Fortran Language Specification Revision 2 0 Rice University Houston Texas 1997 http www crpc rice edu HPFF gt OpenMP Application Program Interface Version 2 5 May 2005 http www openmp org gt Programming in VAX Fortran Version 4 0 Digital Equipment Corporation September 1984 gt IBM VS Fortran IBM Corporation Rev GC26 4119 gt Military Standard Fortran DOD Supplement to American National Standard Programming Language Fortran ANSI x 3 1978 MIL STD 1753 November 9 1978 gt American National Standard Programming Language C AN
81. s the Compiler Feedback tab in the Focus Panel Messages are categorized according to the type of information that they contain For more information on the Compiler Feedback tab refer to Compiler Feedback tab PGI Profiler User Guide 14 Using PGPROF For more information on the Common Compiler Feedback Format CCFF refer to the website http www pgroup com ccff 2 5 1 Special Feedback Messages There are some Compiler Feedback messages that deserve some explanation specifically intensity messages and messages for inlined routines Intensity Messages Computational intensity has been defined as the number of arithmetic operations performed per memory transfer R W Hockney and C R Jesshope Parallel Computers 2 Architecture Programming and Algorithms 1988 The key idea is this a high compute intensity value means that the time spent on data transfer is low compared to the time spent on arithmetic a low compute intensity value suggests that memory traffic involving data transfer may dominate the overall time used by the computer The PGI Compiler emphasizes floating point operations if they are present to calculate the compute intensity ratio within a particular loop If floating point operations are not present the PGI compiler uses integer operations In some cases it is necessary to build programs using profile guided optimization by building with Mpfi or with Mpfo as described in the section Profile Feedback Optim
82. s the number of bytes sent CALLS The number of times a function is called COST The sum of the differences between the timer value entering and exiting a function This includes time spent on behalf of the current function in all children whether profiled or not PGPROF can provide cost information when you compile your program with either the Mprof cost or the Mprof lines option For more information refer to Basic Profiling COUNT The number of times a line or function is executed LINE NUMBER For line mode this is the line number for that line For function mode this is the line number of the first line of the function PGPROF sometimes generates multiple statements for a single source line thus multiple profiling entries might appear for a single source line To distinguish them PGPROF uses the notation lineNo statementNo MESSAGES For MPI profiles only This is the number of messages sent and received by the function or line RECEIVES For MPI profiles only This is the number of messages received by the function or line SENDS For MPI profiles only This is the number of messages sent by the function or line TIME The time spent only within the function or executing the line The TIME does not include time spent in functions called from this function or line TIME may be displayed in seconds or as a percent of the total time 6 3 Caveats Precision of Profiling Results 6 3 1 Accuracy of Performance Data The collection of pe
83. ser Guide 30 Chapter 4 COMMAND LINE OPTIONS This section describes the PGPROF command line options and how they are interpreted As we stated in Getting Started PGPROF can interpret command line options when present on the command line 4 1 Command Line Option Descriptions The following list describes the options and how PGPROF interprets them datafile A single datafile name may be specified on the command line For profiled MPI applications the specified datafile should be that of the initial MPI process Access to the profile data for all MPI processes is available in that case and data may be filtered to allow inspection of the data from a subset of the processes The default datafile name is pgprof out If no datafile argument is used PGPROF attempts to use pgprof out in the current directory exe filename Set the executable to filename The default filename is a out feedbackonly Linux only Only browse source code and Compiler Feedback information Do not load any performance data from profile runs help Prints a list of available command line arguments I lt srepath gt Specify the source file search path PGPROF always looks for a program source file in the current directory first If it does not find the source file in the current directory it consults the search path specified in srcpath The srcpath argument is a string containing one or more directories separated by a path separator The path
84. sful program termination one profile data file is created for each MPI process The master profile data file is named pgprof out The other files have names similar to pgprof out but they are numbered PGPROF MPI profiling collects counts of the number of messages and bytes sent and received You can then use this information to analyze a program s message passing behavior Analyzing the Performance of MPI Programs Figure 9 illustrates an MPI profile This sample shows an example MPI profile with maximum times and counts in the Statistics Table and per process measurements in the Parallelism tab The Parallelism tab for MPI programs is used in the same way that it is used for multi threaded programs as described in Analyzing the Performance of Multi Threaded Programs You can use the send and receive counts for messages the byte counts to identify potential communication bottlenecks and the process specific data to find load imbalances PGI Profiler User Guide 18 Using PGPROF PGPROF File Edit View Bed Fink Jr dh o Hotspot Time pgprof out Function Process Time v Count Messages Messages Sent D main Wax 0 0091 50 CR 100x i Df Max 0 0005 gt o 0 0 i checkd Max 0 3 ON 0 0 0 i rel err ok Max Q ox 0 O 0 Soned By Time S Process Thread Browser for application cpi Profile Time v Count Messages Messages Sent Messages Recei Bytes Bytes Sent Bes Received P9prof out i 2
85. sily view a variety of information quickly PGPROF Common Icons Table 1 provides a summary of the common icons you see in the statistics table during profile navigation PGI Profiler User Guide Using PGPROF Table 1 PGPROF Icon Summary Click this icon to Display the corresponding assembly code for this line Hide the corresponding assembly code for this line Close the tab on which it is displayed Click to expand Focus Panel item Click to hide Focus Panel item e Display the compiler feedback for this line 2 2 Profile Navigation When you first invoke PGPROF it displays top level profiling information in a non closeable tab as illustrated in Figure 2 This tab shows the Statistics Table containing a routine list in the Function column and performance data associated with each routine in the Seconds column This list is sorted by the Seconds value assuming there is such a value in the profile data By default PGI compilers include enough symbol information in executables to allow PGPROF to display performance data at the source line level as well as at the routine level However if you compiled with the option Mnodwarf or Mprof func or if you built your program using another compiler you may only be able to access the routine level view PGI Profiler User Guide 9 Using PGPROF n pgprof SO File Edit View Sort Help o a gl Find dr P Hotspot Seconds y y pgprof
86. t side of the Statistics Table To access the information click on one of these info buttons The information is separated into categories of information about these items gt Asource line gt Variables referenced inside a routine gt Routines referenced inside another routine gt Howa file was compiled Each category is represented by a wide bar that functions like a button Clicking the bar expands the display to show the information in that category If no information is available in a given category that category is not listed This information is only available if the program was compiled and also linked using either the Mprof or the Minfo ccff option In some cases it is necessary to build programs PGI Profiler User Guide 46 PGPROF Reference using profile guided optimization by building with Mpfi or Mpfo as described in the section Profile Feedback Optimization using Mpfi Mpfo in the Optimizing and Parallelizing section of the PGI Compiler User s Guide Profile guided optimization can often determine loop counts and other information needed to calculate the Compute Intensity for a given statement or loop 7 5 4 System Configuration tab This tab displays System and Accelerator tabs containing information about the system on which the profile run was executed System Tab Can include information such as process es process manufacturer processor model processor the program s OS target cores per socket total cor
87. t variable to specify hardware event counters from which to collect data This variable is applied to programs executed with the pgcollect command using one of the event based profiling options The use of hardware event counters is discussed in further detail in Profiling Resource Utilization with Hardware Event Counters PGPROF NAME Use this environment variable to change the name of the output file intended for PGPROF The default is pgprof out This option is only applied to programs compiled with any of the following options Mprof func lines MPI time If a program is compiled with the pg option then the output file is always called gmon out PGI Profiler User Guide 34 Chapter 6 DATA AND PRECISION This section contains descriptions of the profiling mechanism that measures time how statistics are collected and the precision of the profiling results 6 1 Measuring Time The sample based profiling mechanism collects total CPU time for programs that are compiled with the options pg and Mprof time or executed with pgcollect time as described in Sample based Profiling The profiling mechanism collects cycle counts for programs run under the control of pgcollect or executed with pgcollect event based sampling PGPROF automatically converts CPU cycles into CPU time Programs compiled for instrumentation based profiling with Mprof lines or Mprof func employ a virtual timer for measuring the elapsed time
88. the PGI software on the Linux system OProfile is a performance profiling utility for Linux systems It runs in the background collecting information at a low overhead and providing profiles of code based on processor hardware events When installed pgcollect collects this type of performance data for analysis with PGPROF For more information on OProfile refer to http oprofile sourceforge net PGI Profiler User Guide 3 Getting Started Run your program using the pgcollect command for event based sampling with OProfile e MPI profiling is not available with pgco11ect profiling 1 3 Choose Profile Method Use the following guidelines to decide which performance data collection method to use gt A good starting point for any performance analysis is to use time based sampling with pgcollect as described in Basic Profiling If you want exact execution counts build with Mprof func or Mprof lines If you are profiling an MPI application on Linux build your application using Mprof time lt mpi gt where mpi is a supported MPI distribution for example MPICH You can also use an MPI wrapper such as mpicc or mpif90 with Mprof and one of the func lines or time suboptions If you use a wrapper from one of the PGI provided builds of MPI you do not need to modify the wrappers or config files to use them with Mprof If your MPI application also uses OpenMP or multiple threads per process and you want to determine wher
89. the color of all matching text back to black gt Restore Default Settings Use this option to restore the configuration of the user interface to the original default settings gt Revert to Saved Settings Use this option to restore the configuration of the GUI to the previously saved settings For more information refer to the See the Save Settings on Exit option PGI Profiler User Guide 40 PGPROF Reference gt Save Settings on Exit When this check box is selected PGPROF saves the current GUI configuration settings on exit These settings include the size of the main window position of the horizontal dividers the bar chart colors the selected font the tool tips preference and the options selected in the View menu When PGPROF is started again these saved settings are used To prevent saving these settings on exit clear this check box On Linux and Mac OS settings are saved on a per user basis On Windows settings are saved on a per user per system basis e You can also use the Find box in the toolbar to invoke the PGPROF search facility 7 2 3 View Menu Use the View menu to change the configuration of the PGPROF user interface This menu contains the following items gt Select Columns Invokes a dialog box that allows you to select which columns of the Statistics Table are to be displayed and how to display the data in the columns The choices for how to display the data are Value Percent Bar
90. to the left of the P to expand the view to show all threads PGI Profiler User Guide 16 Using PGPROF ii ba pgprof File Edit View Sort Help BEA Find d Hotspot CPU CLK UNHALTED pgprof 4T out 0 Max DATA CACHE MISSES 109 336 000 EN 104 984 000 EN 616 000 195 816 000 MN 21 400 000 12 616 000 7 392 000 4 816 000 3 376 000 3 360 000 1 008 000 1 240 000 968 000 Max DATA CACHE REFILLS FROM 2 728 000 90 176 000 MN 3 536 000 125 856 000 MN 45 264 000 ili 10 560 000 504 000 2 984 000 2 992 000 2 760 000 376 000 736 000 1 008 000 Function mp barrierp mm fv update nonbon a m serial mp barrier tether f nonbon mp barrierw f_torsion read_eval_do eval strcmp v_nonbon a next Max CPU CLK UNHALTED 3 167 120 000 MN 1 483 920 000 0 1 186 800 000 if 899 200 000 ff 768 320 000 183 840 000 150 320 000 84 720 000 76 880 000 76 160 000 70 080 000 53 040 000 45 040 000 tpac 39 520 000 1 152 000 1 144 000 re e 37 760 000 1 280 000 1 256 000 4 gt Sorted By CPU_CLK_UNHALTED OOOO COO OO OO Profile CPU CLK UNHALTED v DATA CACHE MISSES DATA CACHE REFILLS FRO DATA CACHE REFILLS FRO pgprof 4T out E 4 993 60 MM 100 4 993 60 MN 100 491 680 000 MN 100 491 680 000 MN 100 S Process Thread Viewer for routine mm fv update nonbon 309 656 000
91. tuning process when using PGPROF PGPROF sometimes shows multiple rows in the Statistics Table for a single source line The line numbers for such lines are shown in the Statistics Table using the notation line statement There are several situations where this line numbering can occur gt When there is more than one statement in a source line as in a C C program where one line contains multiple statements separated by semicolons gt When the compiler generates multiple alternative implementations of a loop The compiler may create alternate versions to handle differences in the data and how it is stored in memory gt When there is a complicated or conditional loop setup For these cases it is generally safe to sum the times and counts of all the lines However take care particularly with call counts not to double count measurements 7 5 PGPROF Focus Panel The Focus Panel consists of a number of tabs that allow you to select more detailed views of your profile data PGI Profiler User Guide 45 PGPROF Reference Compiler Feedback System Configuration Parallelism Histogram Accelerator Performance Figure 18 Focus Panel Tabs 7 5 1 Parallelism tab This tab displays a table with detailed profile information organized by processes and threads Profile information for the entire application is labeled Profile while information for the currently selected routine is labeled Routine Info
92. uickest way to locate a hot spot By hot spot we mean a program location that has a high value for some performance measurement such as Time Count and so on To locate the hot spot select the desired performance measurement in the HotSpot drop down menu in the Toolbar then click on the Hottest button lt lt illustrated in Figure 6 to select the highest value for that measurement in the current view Dropdown List HotSpot one higher of Performance than where you are Measurements Hottest Next Lower HotSpot HotSpot HotSpot Time KE SS Figure 6 HotSpot Navigation Controls In addition to the HotSpot navigation controls on the toolbar illustrated in Figure 6 you can find the performance critical parts of your program using the Histogram tab which shows clickable bar graphs of the performance data plotted against the address range of the program To find a HotSpot using the Histogram click on the Histogram tab In the histogram for the measurement you are interested in click on the tallest bar The corresponding row in the Statistics Table will be selected 2 4 Sorting Profile Data PGPROF maintains a consistent sort order for the Statistics Table and the Histogram tab Changing the sort order for either of these changes it for both of them The sort order can be changed by using the Sort Menu as described in Sort Menu or by clicking the column header in the Statistics Table or the row header in the Histogram tab
93. v Xebx Xedi 0 000 0 510 Dx404A40 89 C6 mov eax esi 0 000 0 510 Px404842 4C 89 DA movq r11 rdx 0 000 0 510 Dx404AA5 C1 E73 shi 0x3 edi 0 000 0 510 Dx404AAB 4C 89 D1 movq rl10 rcx 0 041 0 510 Dx404AAB 48 63 FF movs1q edi rdi 0 000 0 510 Dx404AAE 66 90 nop 0 000 0 511 if inverse then 0 000 0 512 wk conjg wcn 0 000 0 Sorted By Line S Process Thread Browser for routine Tt Routine Seconds 16 653 MN 16 653 MN S Process Thread Viewer for line 501 Seconds 501 gt Po Parallelism Histogram Compiler Feedback System Configuration Accelerator Performance Profiled fftpde on Mon Nov 01 17 46 50 PDT 2010 Profile pgprof out k 4 Figure 4 Assembly Level View gt To return to a previous view use the Back button lt in the Toolbar just below the Menus The Back and Forward buttons work much like those found in Back Forward web browsers moving to previous and next views respectively bd KI Down Arrows Figure 5 View Navigation Buttons gt To select and jump to a specific view use the down arrow on each of the Forward and Back buttons You can have multiple function views open at a time as illustrated in Figure 4 where tabs for both functions ff t and cf f t3 are displayed PGI Profiler User Guide 12 Using PGPROF 2 3 HotSpot Navigation The HotSpot navigation controls in the Toolbar are usually the q
94. version of this manual To find a help topic use one of the tabs in the left panel The book tab presents a table of contents gt The index tab presents an index of commands gt The magnifying glass tab presents a search engine Each help page displayed on the right may contain hyperlinks denoted in underlined blue to terms referenced elsewhere in the help engine Use the arrow buttons to navigate between visited pages Use the printer buttons to print the current help page About PGPROF This option opens a dialog box with version and contact information for PGPROF PGI Profiler User Guide 42 PGPROF Reference 7 3 PGPROF Toolbar As illustrated in the following figure the PGPROF toolbar provides navigation shortcuts and controls for frequently performed operations Dropdown List HotSpot one higher Scalability of Performance than where you are Analysis Forward Find Entry Measurements Box Hottest Next Lower Print Back Search Search HotSpot HotSpot wa Cader BS lt 9 gt Fina Y Hotspot Time ME REA M Ss Cc Profile Search Controls HotSpot controls Figure 17 PGPROF Toolbar The toolbar includes these buttons and controls gt Open Profile button clicking this button is the same as selecting File Open Profile from the menu bar Print button clicking this button is the same as selecting File Print from the menu bar gt Scalability Analysis button clicking th
95. vides more details about each command Table 3 PGPROF Commands NET Arguments Usage routine gt filename Display the instruction and line level data together with the source and assembly for the specified routine c cff file function line numb Display compiler feedback for the specified file function or source line d isplay display options all none Specify display information PGI Profiler User Guide 51 Command Line Interface NET Arguments Usage he lp Provide brief command synopsis hfistory size Display the history list which stores previous commands in a manner similar to that available with csh or dbx fines function gt filename Display the line level data together with the source for the specified function datafile Load a new dataset With no arguments reloads the current dataset m erge datafile Merge the profile data from the named datafile into the current loaded dataset pro cess processor num For multi process profiles specify the processor number of the data to display pfrint P filename Display the currently selected function data qfuit Exit the profiler select calls timecall time cost cover all gt Display data for a selected subset of the functions cutoff so rt by max avg min proc thread calls Function level data is displayed as a sorted list cover timecall time cost name msgs msgs sent msgs recv bytes bytes se
96. wever NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation Specifications mentioned in this publication are subject to change without notice This publication supersedes and replaces all other information previously supplied NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation Trademarks PGI Workstation PGI Server PGI Accelerator PGF95 PGF90 PGFORTRAN and PGI Unified Binary are trademarks and PGI PGHPF PGF77 PGCC PGC PGI Visual Fortran PVF PGI CDK Cluster Development Kit PGPROF PGDBG and The Portland Group are registered trademarks of NVIDIA Corporation in the U S and other countries Other company and product names may be trademarks of the respective companies with which they are associated Copyright 2013 2014 NVIDIA Corporation All rights reserved PGI

Download Pdf Manuals

image

Related Search

Related Contents

RCA EZ205 Camcorder User Manual  code engineering sets  Guia Rápido DM Series  Zanussi ZRG16602WE refrigerator  

Copyright © All rights reserved.
Failed to retrieve file