Home
        OmpP Usage - Mathematical & Computer Sciences
         Contents
1.                I          SECTIONS                            I LI      SINGLE                             L          PARALLEL            M x          I     M      PARALLEL_LOOP            M               I     M    ene eee             M                I LI  M     PARALLEL_WORKSHARE       x    M               I     M      a eae      17    Property P00001    ImbalanceInParallelLoop    holds for   LOOP muldoe F  68 102      with a severity  in percent  of 0 1991    This section reports so called    performance properties    that are detected automatically  for the application  Performance properties capture common situations of inefficient exe   cution  they are based on the profiling data that is reported for each region    Properties have a name  ImbalanceInParallelLoop  and a context for which they have  been detected  LOOP muldoe F  68 102    Each property carries a severity value  which  represents the negative impact on overall performance a property exhibits  The severity  value is given in percentage of total accumulated execution time  over all threads  and  the properties are reported with decreasing severity values    The code that checks the for the ImbalanceInParallelLoop  property sums the exitBarT  for all threads and the severity is this given by the ratio of this sum to the total execution  time of the application  Hence it detects situations with imbalanced  wrt  threads   amounts of work  Here is a list of all properties defined for ompP     WaitAtBarrier  Th
2.      If hardware counter events are selected for monitoring  see Sect  3 2 5   the hardware  counter data appears in the form of additional columns in the profiling report     R00018 zcopy F  72 76  LOOP    TID execT PAPI_L2_DCM PAPI_TOT_INS  0 0 06  other 82 15 736 041   1 0 06 columns 74 15 736 049  0 06 skipped  93 15 736 074   3 0 06 108 15 736 080  SUM 0 25 357 62 944 244    The counter are displayed in the order in which they are specified  OMPP_CTR1 to OMPP_CTR4   for example   unset counters are skipped  Note that for easier human interpretation the  numbers are grouped by thousands  millions  etc  This grouping is not performed in the  CSV  comma separated values  output format  see Sect  3 2 2     ompP always collects counter data for the sub region where the actual work is performed   i e   the region that contains user code   Some constructs do not actually contain user  code  such as the barrier construct  For other regions it would not make sense to acquire  counters for performance reasons  an example for this is the atomic construct  The table  above marks the sub region for which hardware counter data is reported with a C     13    4 5 Callgraph Region Profiles      00  310 wupwise    01  RO0016 zcopy F  46 78  PARALLEL    02  RO0017 zcopy F  53 57  LOOP    TID execT execC bodyT I bodyT E  exitBarT  0 54 9 0 53 0 53 0 02   1 0 54 9 0 53 0 53 0 01   2 0 54 9 0 54 0 54 0 01  3 0 54 9 0 53 0 53 0 01  SUM 2 17 36 2 12 2 12 0 05    The data displayed in this secti
3.    1 Introduction    ompP is a profiling tool for OpenMP applications written in C C   or FORTRAN   ompP   s profiling report becomes available immediately after program termination in a  human readable format  ompP supports the measurement of hardware performance coun   ters using PAPI  1  and supports several advanced productivity features such as overhead  analysis and detection of common inefficiency situations  performance properties      2 Installation  Please see the file    INSTALL    for detailed instructions on how to install ompP on your    system     3 Usage    3 1 Instrumenting and Linking Applications with ompP    ompP is implemented as a static library that is linked to your application  To capture  OpenMP execution events  ompP relies on Opari  3  for source to source instrumentation   A helper script kinst ompp or kinst ompp papi  is included that hides the details of  invoking Opari from the user  To instrument your application with Opari and link it with  ompP   s monitoring library simply prefix any compile or link command with kinst ompp  or kinst ompp papi  l e   on a shell prompt       gt  icc  openmp foo c bar c  o myapp  becomes    gt  kinst ompp icc  openmp foo c bar c  o myapp    Similarly  to use ompP with Makefiles  simply replace the compiler specification like  CC icc with CC kinst ompp icc           Based on similar scripts from the SCALASCA and KOJAK packages  courtesy of Bernd Mohr  FZ  Juelich    3 1 1 Instrumenting User Defined Regions    U
4.  both parallel regions and combined worksharing parallel  regions are included here  The list is sorted with decreasing wallclock execution time   The percentage gives the amount of total time spent in a particular parallel region    The following two sections lists overheads according to a classification scheme described in  detail in  2   The difference between these two sections is the order in which parallel regions    15    are listed and the way in which percentages are computed  The absolute times reported  are the same for both sections  In the    Overheads wrt  each individual parallel region     section  the parallel regions appear in the same order as in the    Parallel regions sorted  by wallclock time    part  i e   with decreasing overall execution time   The first column  Total is the wallclock time times the number of threads used in the parallel region and  hence corresponds to the overall totally consumed execution time of the parallel region   The listed overheads are similarly summed over all threads  Four overhead category are  distinguished  synchronization  imbalance  limited parallelism and thread management    The Ovhds column is the sum of all the other four overhead categories and all percentages  are computed with respect to each individual parallel region  That is  in the shown  example  R00004 the summed execution time over all threads is 16 02 seconds  and of  this 0 16 seconds  again summed over all threads  is spent in some overheads  This is 
5.  minor revision     The    PAPI Support    line indicates if ompP was built with PAPI support    available    or  not    not available     If PAPI support is present  the header also contains information  about the used counters  if any   If no HW counters are used  the PAPI Active line  will be    no     otherwise it will be    yes    and the reset of the header will show how many  and which counters are used  The Max Counters lists the maximal number of counters  supported by ompP  This is a compile time constant and can be changed by adapting the  definition of OMPP_PAPI MAX CTRS in file ompp h of the source code distribution  For the  usage of evaluators please see Sect  3 2 6     4 2 Region Overview    PARALLEL  7 regions      RO0016 zcopy F  46 78   R00007 muldoe F  63 145   RO0004 muldeo F  63 145   R00013 zaxpy F  48 81   ROOOO1 dznrm2 F  109 145   R00019 zdotc F  50 82   R00022 zscal F  42 69       XA XA XA k       PARALLEL LOOP  3 regions     ROOO1O rndcnf F  48 52     This section lists all OpenMP regions identified by ompP  An ompP region is the lexical  extent of an OpenMP language construct  All OpenMP constructs are supported by ompP  and their accompanying regions are represented by region identifiers  e g   RO0016   The  region overview lists the different region types in the report  gives their region identifies  with their source code location  For most ompP regions the source code location is given  as a file name and two line numbers  begin and end   For 
6.  profiles for each OpenMP construct in a per region basis  The  profiles in this section are flat  I e   the times and counts reported are those incurred for  the particular construct  irrespective of how the construct was executed  For example   imagine a critical section c in a function foo    If foo   is called from two different  parallel regions r1 and r2  the flat profile for c will show summary data for both calls   I e   it is not possible to distinguish between the calls from r1 or r2 in this view  The  callgraph region profiles section of ompP   s profiling report offers the ability to distinguish  data based on the callgraph  see Sect  4 3     The first line the flat region profile lists the region identifier and the source code loca   tion and region type  The next line is the header for the data entries  then data ap   pears on a per thread basis  the thread IDs correspond to those delivered by OpenMP   s  omp_get_thread_num   call    The columns reported by ompP depend on the type of the OpenMP region  Timing entries  end with a capital T  e g   execC  while counts end with a capital C  e g   execC   The  table below lists which timing and count categories are reported for the various OpenMP    11    constructs     AO      main   enter   body   exit     a tc ar tae SS            s   Isis   le    s            t  lelels s  x   lh         lela   lelket  It aid   ul   eee ses sme ape tobe eps lire  Pama th ae Yl ate se ti tee    sais lala lame he Sees aes bale  lele
7. 01  24 45   3 99  24 34   0 70   4 30   0 66   4 03   0 61   3 71     Location  muldeo F  63 145   muldoe F  63 145    zaxpy F  48 81   uinith F  77 114   zcopy F  46 78     R00004  R00007  R00013  R00012  R00016    Overheads wrt  each individual parallel region     Total Ovhds       Synch      Imbal      Limpar      Mgmt        RO00004 16 02 0 16   1 02  0 00  0 00  0 16  1 01  0 00  0   R00007 15 95 0 22   1 40  0 00  0 00  0 22  1 39  0 00  0     Overheads wrt  whole program     Total Ovhds       Synch      Imbal      Limpar    RO00007 15 95 0 22   0 34  0 00  0 00  0 22  0 34  0 00  0   RO00004 16 02 0 16   0 25  0 00  0 00  0 16  0 25  0 00  O     00  0 00  0  00  0 00  0           Mgmt  00  0 00  0  00  0 00  0     01    01     CA    00    00     The overhead analysis report offers various interesting insights into where an application  wastefully spends its time  The first part of the overhead analysis report shows the total  runtime  this corresponds to the duration report of overview section of the profiling report   Sect  4 2   Then the total number of parallel regions is reported  this includes combined  worksharing parallel regions  and then the parallel coverage is listed  The parallel coverage  is the amount of execution time spent in parallel regions  A low parallel coverage limits  the speedup that can be achieved by parallel execution    The next part of the overhead analysis report lists all parallel regions  sorted by their  wallclock execution time  Again 
8. The OpenMP Profiler ompP  User Guide and Manual    Version 0 7 0  March 2009    Karl Fuerlinger  Innovative Computing Laboratory  Department of Computer Science   University of Tennessee   karllatl cslutkledul    Contents   1 Introduction 2  2 Installation 2  3 Usage 2  3 1 Instrumenting and Linking Applications with ompP               2  3 1 1 Instrumenting User Defined Regions                   3  3 1 2 Explicit Measurement Initialization                    3  3 2 Running Applications   bas 24 4 ass wale ah ik bP ee be 3   3 2 1 Disable the collection of performance data for certain types of OpenMP  COMSITIC Se y sesh US mde ee Set ee ok Gs eh ge ee Sky te 4  3 2 2 Selecting the Output Format  caera iia ee a ee 4  3 2 3 Specifying the Name of the Report File                  5  3 24  Disable QUtpu  t at ls tl REA Le Ee CO A 5  3 2 5 Using Hardware Counters withompP                  5  3 2 6 Using evaluators with ompP 4 4  fo  bae lt   dey 6  3 2 7 Incremental Profiling 4d  rs A ie   Ape ry A E 6  4 The Contents of ompP   s Profiling Report 7  4 1 General Information 0 ie a ALE E Be Re wd 8  12  R  gion Overview 6 2 fg it he AAA A Seed a 9  As Calor ae y Va AA 10  4 4 Flat Region Protles osa a ei A wg eh ee ei de 11  4 5 Callgraph Region Profiles 0 00 dd a aa e ida 14  4 6 Overhead Analysis Report 5 ciclo 15  4 7 Performance Properties Report  4 4 fase oe 64 Pa eee A ee oe 17    5 Analyzing ompP   s Profiling Reports 19    6 Known Issues 19  7 Future Work 20  8 Changelog 20 
9. Towards a per   formance tool interface for OpenMP  An approach based on directive rewriting  In  Proceedings of the Third Workshop on OpenMP  EWOMP   01   September 2001     20    
10. a  percentage of 1 02 lost execution time    The    Overheads wrt  whole program    sections lists the same overhead times but the  percentages are computed with respect to the total execution time  i e   duration x number  of threads  in this section  Also  the regions are sorted by their overheads in this part   Hence this section allows the easy identification of those regions that cause most overheads   In the example  RO0007 contributes most overheads to this application s execution with    16    0 22 seconds  or 0 34 percent of total execution time      Overhead classification scheme of ompP     M is thread managemen overhead      S is synchronization overhead    L is limited parallelism overhead    I is imbalance overhead    ea      main   enter   body   exit    ee eer gos es aoe ee ee eee           s   ls s     le    s            t   lele ls s x    h        lela   lelket a Mi a  ae   alte la e lb 1 Sol  lt a Le th    ll a a E e letras lod   leleleluldlolol111lalilwl  Iclxirlilplylalnlelelritinl   construct    TRETT T ey eae eh TATT      A A   Sak sec a ats Sp Sa E Si i wm mS  enh lk Sik  Shh gh Sm eM    Sh See in Si SS i A    MASTER                                  ATOMIC LAS  ose                        BARRIER   Six                        FLUSH   S                             USER_REGION                                  CRITICAL           S                          LOCK           S                        LOOP                           I          WORKSHARE            
11. barrier regions the location is  specified by only one number  Locks are identified by their memory address     4 3 Callgraph    Inclusive     Exclusive        16 46  100 0   5 45  33 08    310 wupwise  4 threads    0 33   2 00   0 33   2 00   PARLOOP   R00010 rndcnf F  48 52   0 65   3 95   0 65   3 95   PARLOOP   R00012 uinith F  77 114   0 12   0 74   0 12   0 74   PARLOOP   RO0011 rndphi F  50 54   0 60   3 68   0 00   0 01   PARALLEL   RO0016 zcopy F  46 78    0 06   0 37   0 06   0 37   LOOP     R00018 zcopy F  72 76   0 54   3 30   0 54   3 30   LOOP     RO0017 zcopy F  53 57   4 09  24 87   0 00   0 01   PARALLEL   R00007 muldoe F  63 145   1 83  11 13   1 83  11 13   LOOP     ROO008 muldoe F  68 102     The callgraph view of ompP   s profiling report shows the callgraph or   tree  of the ex   ecution of the application profiled by ompP  Note that the callgraph only shows the  OpenMP regions visible to ompP and does not contain normal user functions unless the  user manually instruments the functions  see Sect  3 1 1     The right part of the callgraph shows a graphical representation of child   parent relation   ships of the callgraph  Regions are identified by their region number and the source code  location of the corresponding OpenMP construct  The root of the tree is implicitly the  entire application  It is often advisable to manually instrument the main   function of  an application  see Sect  3 1 1   If this is done  the user region corresponding to main    w
12. e  case     UnparallelizedInMasterRegion  UnparallelizedInSingleRegion    Those two properties detect situations of serialized execution in master and single regions  respectively  In a single region  only one thread executes code  other threads have to wait  at the exit barrier unless a nowait is specified  For master the situation is somewhat  different  Only the master thread executes the master region but there is no synchroniza   tion point implied at the end of worksharing threads  Hence it is not know if the other  threads are idle or performing useful work while the master thread executes inside the  master construct  Hence this property does not actually point out an actual inefficiency  situation but merely points to potential case for such a case     5 Analyzing ompP   s Profiling Reports    A set of utilities  perl scripts  will be included in a forthcoming releas to allow for easier  analysis and visualization of ompP   s profiling reports     6 Known Issues    Pathscale C C   compiler incompatibility with expanding preprocessor macros in OpenMP  pragmas     user host  kinst ompp pathcc  mp  o test c_simple c  c_simple c  In function    main      c_simple c 7  error  expected     pragma omp    clause before    POMP_DLIST_00001       Workaround  Invoke the kinst ompp script as  user host  kinst ompp  nodecl    pathcc  mp  o test c_simple c    instead   nodecl is an option to kinst ompp to not use preprocessor definitions and     signifies the end of options to kin
13. ed or OMPP_OUTFORMAT is not specified at all  plaintext ASCII  will be generated         Instead of specifying OMPP_DISABLE LOCK you can also specify OMPP_DISABLE_LOCKS  plural form      3 2 3 Specifying the Name of the Report File    OMPP_APPNAME can be used to specify the name of the target application  ompP uses  the application name to infer the name of the report file  e g   myapp n m ompp txt    Sometimes ompP fails to derive the correct name of the target application  e g   when the  application is not invoked directly but via another command like for example dplace   Use OMPP_APPNAME to specify the applications name in such cases    OMPP_OUTDIR can be used the specify the directory in which to place the report files   OMPP_OVERWRITE can be specified to overwrite existing report files instead of incrementally  numbering them report 1 ompp txt report 2 ompp txt  etc    If OMPP_REPORTFILE is set  this will be used as the name of the report file     3 2 4 Disable Output    To force ompP to give no warning  error or status messages set the environment variable  OMPP_QUIET to any value not equal to 0  No message will be given on output on stdout  or stderr     3 2 5 Using Hardware Counters with ompP    Hardware counters can be used with ompP by setting the environment variables OMPP_CTRn  to the names of PAPI predefined or platform specific event names  For example       gt  export OMPP_CTR1 PAPI_L2_DCM   The number of hardware counters that can be recorded simultaneously b
14. ere n is the number of threads used and m is a consecutive number starting with 0   Consecutive numbering is used because ompP does not per default overwrite existing  profiling reports  See Sect  4 for the contents of ompP   s profiling report    The following Environment variables influence the execution of the monitored application     3 2 1 Disable the collection of performance data for certain types of OpenMP  constructs    By setting the environment variable OMPP_DISABLE_construct  the collection of perfor   mance data for this type of construct is disabled  Example       gt  export OMPP_DISABLE_ATOMIC 1  or    gt  export OMPP_DISABLE_ATOMIC yes    This will disable the collection of performance data for atomic constructs  both examples  are for the bash shell   To re enable data collection either un set the environment variable  or set it to 0  Le        gt  unset OMPP_DISABLE_ATOMIC  or    gt  export OMPP_DISABLE_ATOMIC 0    Constructs which can be disabled are  ATOMIC  BARRIER  CRITICAL  FLUSH  LOOP  MASTER   SECTIONS  SINGLE  WORKSHARE  USER_REGION  and LOCK     3 2 2 Selecting the Output Format    Use the environment variable OMPP_OUTFORMAT to select the format of ompP   s report  Two  formats are currently specified plaintext ASCII  default  and comma separated values   CSV   To select output specify CSV or csv as the value of the OMPP_OUTFORMAT variable   e g        gt  export OMPP_OUTFORMAT CSV    gt  export OMPP_OUTFORMAT csv    If a different string is specifi
15. ill appear as the one and only child of the application and the other regions will appear  below main      The left part of the callgraph view shows inclusive and exclusive times spent in each of  the regions  Inclusive means the sum of the current region and all of its children  while  exclusive is only the time of the current region  this is often called    self    time   The times  reported in this section are    sequential    in the sense that  for example for a critical section  the time listed is the summed execution time over all threads divided by the number of  threads  This makes the execution times of the parallel execution comparable to the  wallclock execution time of the total run  All percentages are computed with respect the  wallclock duration of the entire run also    Note  finally  that the shown callgraph is the    union    of all callgraphs encountered by  OpenMP threads  That is  in principle OpenMP threads can execute independent con   structs  of course synchronizing at implied synchronization points   The callgraph shown  by ompP is the union of all individual callgraphs  Each node of the callgraph was visited  by at least one thread  possibly more     10    4 4 Flat Region Profiles    R00010 rndcnf F  48 52  PARALLEL LOOP    TID execT execC bodyT  exitBarT  startupT shutdwnT  0 0 33 1 0 33 0 00 0 00 0 00  1 0 33 1 0 32 0 01 0 00 0 00  2 0 33 1 0 33 0 00 0 00 0 00  3 0 33 1 0 32 0 01 0 00 0 00  SUM 1 32 4 1 29 0 02 0 01 0 00    This section lists flat
16. is property checks for idle threads at explicit  programmer added barriers     ImbalanceInParallelRegion  ImbalanceInParallelLoop  ImbalanceInParallelWorkshare  ImbalanceInParallelSections    Those properties check for situations of imbalanced work in parallel regions and in work   sharing regions     ImbalanceDueToNotEnoughSections   ImbalanceDueToUnevenSectionDistribution     Those two properties try to uncover the reason for a situation of imbalanced execution  in a Sections construct more precisely  ImbalanceDueToNotEnoughSections is detected  when the number of individual section constructs in an enclosing sections construct is  not sufficient to provide all threads with work  Conversely  ImbalanceDueToUneven   SectionDistribution is detected if there are enough sections but they can not be evenly  distributed among all threads     CriticalSectionContention  LockContention    Those properties check for contention for entering critical sections or for acquiring locks   They are based ont he enterT of the corresponding profiling data structures     18    FrequentAtomic    This property is detected if an atomic construct is executed with a frequency higher than  a predefined threshold    InsufficienWorkInParallelLoop    This property checks the amount of work  i e   execution time  that is performed in a  parallel region  Usually the amount of work should be big enough to amortize the cost of  spawning threads  The InsufficienWorkInParallelLoop property warns if this not th
17. leluldlolol 111laldilwl  Icixiriplylalalelelritinl    construct    T      C  T  T  T  T  C  T  C  T  T  T       AA AE A AS AN PAI A AAA AA A A tea Sede A AR ssl Ym Seago    MASTER    C                             ATOMIC                                  BARRIER                                  FLUSH                                  USER_REGION    C                             CRITICAL       x          C                     LOCK                  C                     LOOP                C                     WORKSHARE                C                     SECTIONS                  C                     SINGLE   x   x              C                 PARALLEL                  Cl                      PARALLEL_LOOP                  Cl                      PARALLEL_SECTIONS                    C                      PARALLEL_WORKSHARE       x          C                     AO      This table shows which timing categories and counts are reported for  the various OpenMP constructs  The    C    indicates for which sub region    hardware counter data is reported  if any    As a general rule  each region is thought of being composed of smaller sub regions and  this corresponds to the times and counts reported the composition as follows     enter    main              body    barrier       exit    I e   main corresponds to the whole construct  e g   zcopy F  46 78   and contains all  other sub regions  The other sub regions axe not nested and do not overlap  hence main      enter   body   bar
18. ncremental profiling reports always only contains the fully executed region  instances        3http   www gnu org software libmatheval     4 The Contents of ompP   s Profiling Report    ompP   s profiling report contains the following parts which will be discussed in detailed in  this section     General Information  Header     e Region Overview    Callgraph    Flat Region Profile    Callgraph Region Profiles    Overhead Analysis Report    e Performance Properties Report    4 1 General Information    Example        ompP General Information                                    Start Date   Tue Apr 17 18 45 57 2007  End Date   Tue Apr 17 18 46 13 2007  Duration 16 46 sec   User Time   15 56 sec   System Time   0 63 sec   Max Threads 7 4   ompP Version   0 6 0   ompP Build Date   Apr 17 2007 18 36 24  PAPI Support   available   Max Counters 7 A   PAPI Active   yes   Used Counters 12   OMPP_CTR1   PAPI_L2_DCM   OMPP_CTR2   not set   OMPP_CTR3   PAPI_TOT_INS   OMPP_CTR4   not set   Max Evaluators   4   Used Evaluators   0   OMPP_EVAL1   not set   OMPP_EVAL2   not set   OMPP_EVAL3   not set   OMPP_EVAL4   not set    This section of the profiling report lists general data about the profiling report  This  includes the times of the start and end of the program run and its duration  in seconds  wallclock time   The number of threads used for the execution  the date when the ompP  library was built and the ompP version used  The ompP version string is represented as  three numbers  major
19. on is largely similar to the flat region profiles  However   the data displayed represents the summed execution times and counts only of the current  execution graph    The path of the root of the callgraph to the current region is shown as the first lines for  each region  The lines have the format  sxy   where xy denotes the level in the hierarchy   starting with 0  the root  and the symbol s has the following meaning    stands for root  of the callgraph   denotes that this entry has children in the call graph  while   denotes  that this region has no child entries in the callgraph  it is a leaf of the callgraph     The data entries displayed for callgraph region profiles are similar to the ones shown for  flat profiles  However  for selected columns both inclusive and exclusive data entries are  displayed  Inclusive data represents this region and all descendants  while exclusive data  excludes any descendants  In the example shown above the data is displayed for a leaf  node and hence inclusive and exclusive times for bodyT are the same    Hardware counter data is handled similar to timing data  i e   a  I or a  E is appended  to the counter name  for example PAPI_L2_DCM I and PAPI_L2_DCM E     14    4 6 Overhead Analysis Report    16 38 sec  4 threads   10  10 93 sec  66 73      Total runtime  wallclock   Number of parallel regions  Parallel coverage    Parallel regions sorted by wallclock time    Type  PARALLEL  PARALLEL  PARALLEL  PARLOOP  PARALLEL    Wallclock      4 
20. rier   exit    The enter subregion corresponds to entering or starting a construct  For critical sections  this is the time required to enter the sections and for parallel regions and combined  worksharing parallel regions this is the time for thread startup    Similarly  the exit subregion corresponds to leaving a construct  signalling the critical  section or lock as available  or for shutting down threads    The barrier subregion is present for worksharing regions and represents the time wait   ing at implicit exit barriers of those worksharing constructs  unless a nowait clause is  specified    The body is the part of the construct where the actual work gets done  Some constructs  such as USER_REGION do not have enter  barrier or exit sub regions  In this case only main  is reported    Finally  the counts and times for each sub region are reported with different names to  reflect their meaning for different OpenMP constructs  For example  the time in the enter  subregion is reported as enterT for locks and critical sections but as startupT for parallel  regions  Please refer to the table above for a detailed list of reported times and counts  for all OpenMP constructs    Also note that the times and counts are always reported left to right such that main is  followed by body  followed by barrier  followed by enter  and exit  execT and execC are  reported for any construct and the time reported by execT should always be the sum of  all other times reported for all threads 
21. ser defined regions  both on a block level as well as whole functions  can be instru   mented with Opari   s pragma handling mechanism  The pragma  pragma pomp inst  begin name  marks the begin of the region and  pragma pomp inst end name  marks  its end  For example     int foo      pragma pomp inst begin foo      pragma pomp inst end foo   return 1          To mark an alternative exit point of a region  such as an addtional return statement  use  the  pragma pomp inst altend name  pragma  For FORTRAN the syntax is     I POMP INST BEGIN  name       POMP INST ALTEND  name       I POMP INST END  name     3 1 2 Explicit Measurement Initialization    ompP will per default start monitoring when the first call to an OpenMP construct  or  user instrumented region  is made  To explicitly start monitoring  typically in main       use the Opari  pragma pomp inst init pragma  This is especially useful for programs  that do a significant amount of sequential work before entering the parallel regions  Plac   ing  pragma pomp inst init at the beginning of main    will guarantee that some timing  results  such as the parallel coverage  see Sect  4 6  will be reported correctly by ompP  in this case     3 2 Running Applications    Invoke the instrumented OpenMP application like a regular OpenMP application  Make  sure OMP_NUM_THREADS is set to the desired number of OpenMP threads  For an application  called myapp  ompP will per default write the profiling report to myapp n m  ompp  txt   wh
22. st ompp     19    7    Future Work    Support for other additional output formats  e g   XML   Write converter of profiling reports into the CUBE format     Allow named evaluators  for example missrate L2_MISSES MEM_REFERENCES  use  name of evaluator as the column head then     Instead of displaying inclusive and exclusive data in callgraph region profiles  display  data differentiating into self   descendants  children   This might make manual  reasoning easier  Example bodyT S and bodyT C     Changelog  ompP v0 6 0  May 2007  Initial public release   ompP v0 6 1  June 2007  Some minor bug fixes and tweaks     ompP v0 6 2  July 2007         Fixed a bug in which FORTRAN workshare constructs were not properly  maintained on the call stack  thanks to Alan Morris for reporting this          Fixed a bug in which loop constructs with a nowait clause would not show up  with any monitoring data     References     1  Shirley Browne  Jack Dongarra  N  Garner  G  Ho  and Philip J  Mucci  A portable    programming interface for performance evaluation on modern processors  Int  J  High  Perform  Comput  Appl   14 3  189 204  2000     Karl F  rlinger and Michael Gerndt  Analyzing overheads and scalability characteris   tics of OpenMP applications  In Proceedings of the Seventh International Meeting on  High Performance Computing for Computational Science  VECPAR   06   pages 39 51   Rio de Janeiro  Brasil  2006  LNCS 4395     Bernd Mohr  Allen D  Malony  Sameer S  Shende  and Felix Wolf  
23. the region for which hardware  counter data was acquired    ompP uses libmatheval   to evaluate the numeric values of the evaluator strings and  those values appear as additional columns in the profiling report similar to plain hardware  counters  see Sects  4 4 and 4 5      3 2 7 Incremental Profiling    Incremental profiling refers to the method of continuously capturing profiling reports while  the program is running  While pure     one shot     profiling does usually not allow one to  uncover and explain reason and temporal relationship of performance phenomena  this is  possible with full event tracing and to some extent also with incremental profiling  Hence   incremental profiling can be a good compromise between full tracing and pure one shot  profiling as it usually is less intrusive and generates smaller amounts of performance data   Incremental profiling is enabled in ompP by setting the environment variable OMPP_DUMP_   INTERVAL to the desired duration  in seconds  between capture points  The duration is  given in seconds and must be at least 0 1 seconds  shorter intervals would likely cause too  much overhead to produce usable results       gt  export OMPP_DUMP_INTERVAL 1    The profiling reports are delivered in the same format as normal ompP profiling reports   see Sect  4   The data contained in a profiling dump at time x contains performance data  as it is available at time x  Since ompP   s region statistics are always updated when regions  are exited  the i
24. y ompP is a  compile time constant set to 4 per default  see the definition of OMPP_PAPI_MAX_CTRS in  file ompp h if you want to increase this limit    During startup ompP will display a message whether registering the specified counter s   was successful     ompP  successfully registered counter PAPI_L2_DCM    If the specified event name s  are either not recognized or cannot be counted together   ompP will issue a warning     ompP  PAPI name to code error  for an unrecognized event name  or   ompP  Error adding event to eventset    for conflicting events that cannot be counted together by the underlying hardware     3 2 6 Using evaluators with ompP    Evaluators are a convenience feature of ompP to transform hardware performance counter  based data into more readable forms directly in the profiling tool  An evaluator is an  arithmetic formula in string form that can involve hardware counter date  For example       gt  OMPP_EVAL1 PAPI_FP_OPS 1000000   compute the megaflop rate     gt  OMPP_EVAL1 1 L2_MISSES L2_REFERENCES   L2 hit rate  Itanium      gt  OMPP_EVAL1 1  L3_MISSES L3_WRITES_L2_WB_MISS      L3_REFERENCES L3_WRITES_L2_WB_ALL    L3 hit rate  Itanium     ompP will extract hardware counter names from evaluator strings and program PAPI  to collect the necessary data  In addition to PAPI event counter names  evaluators can  contain numeric constants and they can contain references to EXECT and EXECC  Those two  special variables denote the execution time and counts for 
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
Samsung MW822WB Manual de Usuario  TP-Link Archer T2U Quick Installation Guide  Verado 350 SCi Viertakt - Brunswick Marine in EMEA  All users: - Toetsen.nl  Manuel de l`utilisateur  Turny Evo - Autoadapt  Mode d`emploi Bulrézo  I-7000 and M-7000 DIO User's Manual  Release Notes  Lab 2    Copyright © All rights reserved. 
   Failed to retrieve file