Home
        Curie advanced userguide - VI-HPS
         Contents
1.     Hardware topology       Machine  128GB     Socket P 0    P O    e P 8 Cor Coe P 1    Socket P 2    Col    NUMANode P 2  32GB     Socket P 1    L3  24MB      0    NUMANode P 3  32GB     Socket P 3    L3   4MB     Core P 0 Coi P 1                Hardware topology of a Curie fat node    The hardware topology is the organization of cores  processors  sockets and memory in a node  The previous  image was created with hwloc  You can have access to hwloc on Curie with the command module load hwloc     Definitions    We define here some vocabulary     e Binding  a Linux process can be bound  or stuck  to one or many cores  It means a process and its threads  can run only on a given selection of cores  For example  a process which is bound to a socket on a Curie fat  node can run on any of the 8 cores of a processor    e Affinity   it represents the policy of resources management  cores and memory  for processes    e Distribution   the distribution of MPI processes describes how theses processes are spread accross the  core  sockets or nodes     On Curie  the default behaviour for distribution  affinity and binding are managed by SLURM  precisely the ccc_mprun  command     Process distribution    We present here some example of MPI processes distributions     e block or round  this is the standard distribution  From SLURM manpage  The block distribution method will  distribute tasks to a node such that consecutive tasks share a node  For example  consider an allocation of  two nod
2.    Hla pse d time limit in s e conds    S ta nda rd output   l is the job id    Proje ct ID    Disable de fa ult S LURM binding    mopirun   bind to core  np 32   a  out    Socket binding  the process and his threads can run on all cores of a socket          bir ba sh    M5 UB  r MyJob Pa ra   M5 UB  n 32    M5 UB  x    M5 UB  T 1800    MS UB  o e xa mple _ l 0   M5 UB  A pa xxxx     M5 UB  E    cpu _bind none        Re quest name    Number of tasks to use    Require a exclusive node    Ha pse d time limit in se conds    S ta nda rd output   lis the job id    Proje ct ID    Disable de fa ult S LURM binding    mopirun   bind to s ocke t  np 32   a  out    You can specify the number of cores to assign to a MPI process         bir ba sh    M5 UB  r MyJob Pa ra   MS UB  n 32    MS UB  x    M5 UB  T 1800    M5 UB  o e xa mple _ l 0   M5 UB  A pa xxxx     M5 UB  E      cpu_bind none        Re quest name    Number of tasks to use    Require a exclusive node    Ha pse d time limit in se conds    S ta nda rd output   lis the job id    Proje ct ID    Disable de fa ult S LURM binding    mopirun   bind to s ocke t   cpus  pe r proc 4  np 8   a  out    Here we assign 4 cores per MPI process     Manual process management    BullxMPI gives the possibility to manually assign your processes through a hostfile and a rankfile  An example         bir ba sh    M5 UB r MyJob Pa ra   MS UB  n 32    MS UB  x    M5 UB  T 1800    MS UB  o e xa mple _ l 0   M5 UB  A pa xxxx     M5 UB  E    cpu _bind no
3.    curie 1342 19946    40080 0  0  odis  de fa ult fork binding child   40080  1  6  to s ocke t 3 cpus 88888888   curie 1342 19946    40080 0  0  odis  de fa ult fork binding child   40080  1  7   to s ocke t 3 cpus 88888888   curie 1342 19946    40080 0  0  odis  de fa ult fork binding child   4 0080  1   0  to s ocke t O cpus 11111111   curie 1342  19946    40080 0  0  odis  de fa ult fork binding child   40080 1  1  to socke t O cpus 11111111   curie 1342  19946    40080 0  0  odis  de fa ult fork binding child   40080 1  2  to socke t 1 cpus 22222222    In the following paragraphs  we present the different possibilities of process distribution and binding  These options  can be mixed  if possible      Remark  the following examples use a whole Curie fat node  We reserve 32 cores with  MSUB  n 32 and  MSUB  x  to have all the cores and to do what we want with them  This is only examples for simple cases  In others case   there may be conflicts with SLURM     Process distribution    Block distribution by core         bir ba sh    M5 UB r MyJob Pa ra    MS UB  n 32    MS UB  x    MS UB  T 1800    M5 UB  o e xa mple_ l o   M5 UB  A pa xxxx    M5 UB  E    cpu _bind none        Request name    Number of tasks to use    Re quire a exclusive node    Ha pse d time limit in se conds    S ta nda rd output   l is the job id    Proje ct ID    Disable de fa ult S LURM binding    mopirun   by core  np 32   a  out    Cyclic distribution by socket         biryba sh    M5 UB  r MyJob Pa r
4.   Tota   flpins   100056592  MFLOPS  1944 826    If you want more precisions  you can contact us or visit PAPI website     VampirT race Vampir    VampirTrace is a library which let you profile your parallel code by taking traces during the execution of the  program  We present here an introduction of Vampir Vampirtrace     Basics    First  you must compile your code with VampirTrace compilers  In order to use VampirTrace  you need to load the  vampirtrace module     ba sh 4 00   module loa d va mpirtra ce  ba sh 4 00   vtcc  c prog c  ba sh 4 00   vtcc  o prog e xe prog o    Available compilers are      e vtcc   C compiler  e vtc    vtCC et vtcxx   C   compilers  e vtf77 et vtf90   Fortran compilers    To compile a MPI code  you should type      ba sh 4 00   vtcc  vt ce mpice  g  c prog c  ba sh 4 00   vtcc  vt cc mpicc  g  o prog e xe prog o    For others languages you have      e vtcc  vt cc mpicc   MPI C compiler  e vtc    vt cxx mpic    vtCC  vt cxx mpiCC et vtcxx  vt cxx mpicxx   MPI C   compilers  e vtf77  vt f77 mpif77 et vtf90  vt f90 mpif90   MPI Fortran compilers    By default  VampirTrace wrappers use Intel compilers  To change for another compiler  you can use the same  method for MPI     ba sh 4 00   vtcc  vt ce gcc  02  c prog c  ba sh 4 00   vtcc  vt cc gcc  02  o prog e xe prog o    To profile an OpenMP or a hybrid OpenMP MPI application  you should add the corresponding OpenMP option for the  compiler     ba sh 4 00   vtcc  ope nmp  O2  c prog c  ba sh 4 00  
5.   cpu bind MAS K   curie 1139  task 5 5  18761   ma sk 0x40404040set  cpu bind MAS K   curie 1139  task 0 0  18756   ma sk 0x 1010101 set  cpu_bind MAS K   curie 1139  task 1 1 18757   ma sk 0x 10101010 set  cpu_bind MAS K   curie 1139  task 6 6  18762   mask 0x8080808 se t  cpu_bind MAS K   curie 1139  task 4 4  18760   mask 0x4040404 set  cpu bind MAS K   curie 1139  task 3 3  18759   mask 0x20202020set  cpu_bind MAS K   curie 1139  task 2 2  18758   mask 0x2020202 set  cpu bind MAS K   curie 1139  task 7 7  18763   mask 0x80808080 set    We can see here the MPI rank 0 process is launched over the cores 0 8 16 and 24 of the node  These cores are all  located on the node s first socket     Remark  With the  c option  SLURM will try to gather at best the cores to have best performances  In the previous  example  all the cores of a MPI process will be located on the same socket     Another example       ccc_mprun  n 1  c 32  E    cpu _bind ve rbose     a  out  cpu bind MAS K   curie 1017  task 0 0 34710   mask Oxffffffff set    We can see the process is not bound to a core and can run over all cores of a node     BullxMPI    BullxMPI has its own process management policy  To use it  you have first to disable SLURM s process management  policy by adding the directive  MSUB  E    cpu_bind none    You can then use BullxMPI launcher mpirun         bir ba sh    M5 UB  r MyJob Para   Re que st na me    M5 UB  n 32   Number of tasks to use    MS UB  x   Require a exclusive node    MS
6.  E 5 98e4 Receives E 0 00 gradphiorder3_ E  0   ttane1520   Clo Colective E 0 00 phiddorder3_    i 2 65 Process 16  6340 Exchange E 2 91 sub_defstffmat_helmhokz_   2 85 Process 17     W960 As source  E 1 58e4 As destination    E 0 00 gradphiorder3_ 2 89 Process 18    0 00 phiedorder3_ 3 04 Process 19       GC 0 Bytes transferred  0 00 MPLRecv 2 85 Process 20  To Point to point loco MPLirecy 3 03 Process 21  E 2999 Sent ooo MPL_Watany 2 95 Process 22   2 9909 Received loco MPL_Comm_group 3 55 Process 23   Do collective loco MPL_Group_inct    0  thanes21       E 2 62e9 Outgoing  E 2 62e9 Incoming    m_create I 2 73 Process 24   000 MPL_Group_free  E 2 73 Process 25          E C 0 00 Computational imbalance C 0 00 MPI_Comm_spit  E 2 56 Process 26  16 44 Overload  000 mPt Reduce 231 Process 27  L E 235 Single participant ooo MPI Akoa  E 6 44 Underload  0 00 MPL_Barrier    E C 0 00 Non partcipation    D000 MPL Send  LE 235 Sngulariy     0 00 MPI Waral   000 MPL_Isend  ooo MPI Test  C 0 00 MPL_Iprobe  ooo MPI War   0 00 MPI Probe  ooo MPI Testal  ooo MPI_Ssend  E 1 81 sub_wrteunstruct    000 MPL_Finatze  L DOCO TRACING                a D       D     ion                            8   g  5    fo c0 88 80  27 12   327 40        o0 73                                  Scalasca    If you need more information  you can contact us     Scalasca   Vampir    Scalasca can generate OTF tracefile in order visualize it with Vampir  To activate traces  you can add  t option to  scalasca when you 
7.  MS UB  o e xa mple _ l 0   Sta nda rd output   lis the job id   MS UB e exa mpe _ l e   Error output   l is the job id   M5 UB  A pa Xxxx   Proje ct ID     M5 UB  E   x curie  1000 1003     Exclude 4 nodes  curie 1000 to curie 1003   set x    cd   BRIDGE_MS UB PWD   cc_mprun   a  out    MPI    Embarrassingly parallel jobs and MPMD jobs    e Anembarrassingly parallel job is a job which launch independent processes  These processes need few or  no communications   e A MPMD job is a parallel job which launch different executables over the processes  A MPMD job can be  parallel with MPI and can do many communications     These two concepts are separate but we present them together because the way to launch them on Curie is  similar  An simple example in the Curie info page was already given     In the following example  we use ccc_mprun to launch the job  srun can be used too  We want to launch binO on the  MPI rank 0  bin1 on the MPI rank 1 and bin2 on the MPI rank 2  We have first to write a shell script which describes  the topology of our job     launch_exe sh         biryba sh  if    S LURM PROCID  e q 0   the n     bin0  fi  if    S LURM PROCID  eq 1   the n      bin1  fi  if    S LURM PROCID  e q2    the n    bin2  fi    We can then launch our job with 3 processes     ccc mpun  n 3   la unch e xe  sh    The script launch_exe sh must have execute permission  When ccc_mprun launches the job  it will initialize some  environment variables  Among them  SLURM_PROCID defines th
8.  UB  T 1800   Ha pse d time limit in se conds   MS UB  o e xa mple_ l o   Sta nda rd output   lis the job id   M5 UB  A pa Xxxx   Proje ct ID    M5 UB  E    cpu _bind none     Disa ble de fa ult S LURM binding    mopirun  np 32   a  out    Note  In this example  BullxMPI process management policy can be effective only on the 32 cores allocated by  SLURM     The default BullxMPI process management policy is     e the processes are not bound  e the processes can run on all cores  e the default distribution is block by core and by node    The option   report bindings gives you a report about the binding of processes before the run         bir ba sh    M5 UB  r MyJob Para   Re que st name    M5 UB  n 32   Number of tasks to use    MS UB  x   Require a exclusive node    M5 UB  T 1800   Ha pse d time limit in se conds    M5 UB  o e xa mple_ l o   Sta nda rd output   l is the job id   MS UB  A pa Xxxx   Proje ct ID    M5 UB  E      cpu_bind none     Disa ble de fa ult S LURM binding    mpirun   re port bindings   bind to s ocke t   cpus  pe r proc 4  np 8   a  out    And there is the output       mpirun   bind to s ocke t   cpus  pe r proc 4  np 8   a  out    curie 1342 19946    40080 0  0  odis  de fa ult fork binding child   40080 1  3  to socke t 1 cpus 22222222   curie 1342 1994 6    40080 0  0  odis  de fa ult fork binding child   40080 1  4  to socke t 2 cpus 44444444   curie 1342  19946    40080 0  0  odis  de fa ult fork binding child   40080 1  5  to socke t 2 cpus 44444444
9.  before the ccc_mprun command  The form of  the corresponding environment variable is OMPI_MCA_xxxxx where xxxxx is the parameter         biryba sh    M5 UB  r MyJob Pa ra   Re quest name    MS UB  n 32   Number of tasks to use    MS UB  T 1800   Ha pse d time limit ins e conds    MS UB  o e xa mple_ l o   Sta nda rd output   l is the job id   M5 UB  e exa mpe _ l e   Error output   lis the job id   MS UB  A pa Xxxx   Proje ct ID    set x   cd   BRIDGE_MS UB PWD    e xport OMPI_MCA_mpi_s how_mca _pa ra ms  a ll  ccc_mprun   a  out    Optimizing with BullxMPI    You can try theses parameters in order to optimize BullxMPI     e xport OMPI MCA _mpi le a ve _pinne d 1    This setting improves the bandwidth for communication if the code uses the same buffers for communication during  the execution     e xport OMPI MCA _btl_ope nib use _e a ge r_rdma  1    This parameter optimizes the latence for short messages on Infiniband network  But the code will use more  memory     Be careful  theses parameters are not set by default  They can have influences on the behaviour of your codes     Debugging with BullxMPI    Sometimes  BullxMPI codes can hang in any collective communication for large jobs  If you find yourself in this case   you can try this parameter     e xport OMPI MCA _coll   ghc  tune d     This setting disables optimized collective communications  it can slow down your code if it uses many collective  operations     Process distribution  affinity and binding    Introduction
10.  in the following section     Intel    e  opt_report   generates a report which describes the optimisation in stderr   O3 required    e  ip   ipo   inter procedural optimizations  mono and multi files   The command xiar must be used instead of  ar to generate a Static library file with objects compiled with  ipo option    e  fast   default high optimisation level   O3  ipo  static     Carefull   This option is not allowed using MPI  the MPI  context needs to call some libraries which only exists in dynamic mode  This is incompatible with the  static  option  You need to replace  fast by  O3  ipo   e  ftz   considers all the denormalized numbers  like INF or NAN  as zeros at runtime    e  fp relaxed   mathematical optimisation functions  Leads to a small loss of accuracy    e  pad   makes the modification of the memory positions operational  ifort only     There are some options which allow to use specific instructions of Intel processors in order to optimize the code   These options are compatible with most of Intel processors  The compiler will try to generate these instructions if  the processor allow it     e  xSSE4 2   May generate Intel   SSE4 Efficient Accelerated String and Text Processing instructions  May  generate Intel   SSE4 Vectorizing Compiler and Media Accelerator  Intel   SSSE3  SSE3  SSE2  and SSE  instructions    e  xSSE4 1   May generate Intel   SSE4 Vectorizing Compiler and Media Accelerator instructions for Intel   processors  May generate Intel   SSSE3  
11.  is only one  time  You can increase  or reduce  the size of the buffer  your code will also use more memory  To change the  size  you have to initialize an environment variable      export VT_BUFFER_S ZE 64M  ccc_mprun   prog e xe    In this example  the buffer is set to 64 MB  We can increase the maximum number of flushes     export VT_MAX_FLUS HES  10  ccc_mprun   prog e xe    If the value for VT_MAX_FLUSHES is 0  the number of flushes is unlimited     By default  Vampirtrace will first store profiling information in a local directory   tmp  of process  These files can be  very large and fill the directory  You have to change this local directory with another location     export VT_PFORM LDIR  S CRATCHDIR    There are more Vampirtrace variables which can be used  See User Manual for more precisions     Vampirserver    Traces generated by Vampirtrace can be very large  Vampir can be very slow if you want to visualize these traces   Vampir provides Vampirserver  it is a parallel program which uses CPU computing to accelerate Vampir  visualization  Firstly  you have to submit a job which will launch Vampirserver on Curie nodes       cat va mpirse rver sh        biryba sh    MS UB  r va mpirserver   Re quest name    M5 UB  n 32   Number of tasks to use    M5 UB  T 1800   Ela pse d time limit in s e conds    MS UB  o va mpirs e rve r_ l o   S ta nda rd output   lis the job id   MS UB e va mpirse rve r_ l e   Error output   lis the job id  ccc_mprun vingd      module loa d va mp
12.  some  regions of memory than others  This is due to the fact that all memory regions are not physically on the same bus     NUMA node   Curie hybrid  node       In this picture  we can see that if a data is in the memory module 0  a process running on the second socket like  the 4th process will take more time to access the data  We can introduce the notion of local data vs remote data  In  our example  if we consider a process running on the socket 0  a data is  ocal if it is on the memory module 0  The  data is remote if it is on the memory module 1     We can then deduce the reasons why tuning the process affinity is important     e Data locality improve performance  If your code use shared memory  like pthreads or OpenMP   the best  choice is to regroup your threads on the same socket  The shared datas should be local to the socket and  moreover  the datas will potentially stay on the processor s cache    e System processes can interrupt your process running on a core  If your process is not bound to a core or to  a socket  it can be moved to another core or to another socket  In this case  all datas for this process have  to be moved with the process too and it can take some time    e MPI communications are faster between processes which are on the same socket  If you know that two  processes have many communications  you can bind them to the same socket    e On Curie hybrid nodes  the GPUs are connected to buses which are local to socket  Processes can take  longer tim
13.  vtcc  ope nmp  O2  o prog e xe _prog o    Then you can submit your job  Here is an example of submission script         biryba sh    M5 UB  r MyJob Pa ra   Re quest name    M5 UB  n 32   Number of tasks to use   4 MS UB  T 1800   Ha pse d time limit in s e conds    M5 UB  o e xa mple_ l o   S ta nda rd output   lis the job id   MS UB  e exa mpe _ l e   Error output   lis the job id  set x    cd   BRIDGE_MS UB PWD   ccc_mprun   prog e xe  At the end of execution  the program generates many profiling files      ba sh 4 00   Is  a  out a  out 0 de f z a  out 1 e ve nts z     a  out otf    To visualize those files  you must load the vampir module     ba sh 4 00   module loa d va mpir  ba sh 4 00   va mpir a  out otf                         C    Vampir   Tirace View   jece scratchicont0o0 sainguyenl stream_all_gpu nguyenl_stream_cuda exe_5403 otf   sur bulbi20  eee  Mme Est Chan Fler Window Heb    leila   SF   a  Bruises LI Es a   z T  mim  Application  psi LS Ss E ee ee ee  2727    E Met    1 567 s  M  cUDA_sYNC  250 955 ms   VT_API   lt 100 ms  VT_CUDA    Process 1    Process 2    ammes _    Process 3    Process 4    Process 5    a  gt     Cortex Vew    Furciin Legend                Vampir window    If you need more information  you can contact us     Tips    Vampirtrace allocate a buffer to store its profiling information  If the buffer is full  Vampirtrace will flush the buffer  on disk  By default  the size of this buffer is 32MB per process and the maximum number of flushes
14. 046163    To get the available hardware counters  you can type  papi_avail  commande   This library can retrieve the MFLOPS of a certain region of your code     progra m ma in  implicit none  include  f90pa pi h        inte ger  parameter    size   1000   inte ge r  pa ra meter    ntimes   100   double pre cision  dime ns ion size    size      A B C   inte ger    ijn     Va ria ble PAPI   inte ger   retval   re a I kind 4     proc_time   mflops  re a  _time   inte ge r kind 8     flpins     Init PAPI   re tva     PAPLVER_CURRENT   ca Il PAPIf_libra ry_init re tva      if   re tva   NE PAPLVER_CURRENT  the n  print      PAPI libra ry_init   re tva     endif   ca ll PAPIf_que ry_e ve nt PAPI_FP_INS   re tva      1 Init Ma trix       Cli    rea l i j 8   Bi     i 0 14j  end do  enddo    S e tup Counte r  ca    PAPIf_flips   re a  _time   proc_time   flpins  mflops  re tva I     DAXPY  do n 1 ntime s  do i 1 size  doj 1 size  A ij    2 0 B ij    Cli   end do    end do  enddo    Colle ct the data into the Variables passedin  ca ll PAPIf_flips   re a  _time   proc_time   flpins   mflops  re tva       Print re sults  print     Re a   time      realtime  print      Proc_time      proc_time  print      Tota   flpins      flpins  print      MFLOPS      mflops  jj    e nd progra m ma in    and the output     ba sh 4 00   module loa d pa pi 4 1 3  ba sh 4 00   ifort  I  PAPI_INC_DIR  pa pi_flops  f90   PAPI_LIBS    ba sh 4 00     a  out  Realtime  6 12 50001E 02  Proc_time  5 1447589E 02
15. Contents    e 1 Curie s advanced usage manual  2 Optimization  o 2 1 Compilation options  a 2 1 1 Intel    a 2 1 1 1 Intel Sandy Bridge processors    a 2 1 2 GNU    e 3 Submission  o 3 1 Choosing or excluding nodes    e 4 MPI  o 4 1 Embarrassingly parallel jobs and MPMD jobs  o 4 2 BullxMPI    4 2 1 MPMD jobs  4 2 2 Tuning BullxMPI  4 2 3 Optimizing with BullxMPI    a  a  a  a 4 2 4 Debugging with BullxMPI    5 Process distribution  affinity and binding  o 5 1 Introduction    a 5 1 1 Hardware topology  5 1 2 Definitions    a  a 5 1 3 Process distribution  a 5 1 4 Why is affinity important for improving performance    a 5 1 5 CPU affinity mask  o 5 2 SLURM  a 5 2 1 Process distribution  a 5 2 1 1 Curie hybrid node    a 5 2 2 Process binding    o 5 3 BullxMPI  a 5 3 1 Process distribution  a 5 3 2 Process binding  a 5 3 3 Manual process management    e 6 Using GPU  o 6 1 Two sequential GPU runs ona single hybrid node    e 7 Profiling   o 7 1 PAPI   o 7 2 VampirTrace Vampir  7 2 1 Basics  7 2 2 Tips  7 2 3 Vampirserver  7 2 4 CUDA profiling    o 7 3 Scalasca  a 7 3 1 Standard utilization  a 7 3 2 Scalasca   Vampir  a 7 3 3 Scalasca   PAPI    o 7 4 Paraver  a 7 4 1 Trace generation  a 7 4 2 Converting traces to Paraver format  a 7 4 3 Launching Paraver    Curie s advanced usage manual    If you have suggestions or remarks  please contact us   hotline tgcc cea fr    Optimization    Compilation options  Compilers provides many options to optimize a code  These options are described
16. SSE3  SSE2  and SSE instructions     xSSSE3   May generate Intel   SSSE3  SSE3  SSE2  and SSE instructions for Intel processors     xSSE3   May generate Intel   SSE3  SSE2  and SSE instructions for Intel processors     xSSE2   May generate Intel   SSE2 and SSE instructions for Intel processors     xHost   this option will apply one of the previous options depending on the processor where the compilation   is performed  This option is recommended for optimizing your code     None of these options are used by default  The SSE instructions use the vectorization capability of Intel processors     Intel Sandy Bridge processors    Curie thin nodes use the last Intel processors based on Sandy Bridge architecture  This architecture provides new  vectorization instructions called AVX for Advanced Vector eXtensions  The option  xAVX allows to generate a  specific code for Curie thin nodes     Be careful  a code generated with  xAVX option runs only on Intel Sandy Bridge processors  Otherwise  you will get  this error message     Fa ta   Error  This progra m wa s not built to run in your syste m   Please verify that both the ope ra ting syste m a nd the proce s s or support Inte  R  AVX     Curie login nodes are Curie large nodes with Nehalem EX processors  AVX codes can be generated on these  nodes through cross compilation by adding  xAVX option  On Curie large node  the  xHost option will not generate a  AVX code  If you need to compile with  xHost or if the installation requires som
17. _VIS IBLE DEVICES    S LURM PROCID    the first process willsee only the first GPU andthe second process willsee only the second GPU   if    S LURM PROCID  e q 0    then     bin_1  gt  job_  S LURM_PROCID  out    fi  if    S LURM PROCID  e q1    n     bin 2  gt  job_  S LURM_PROCID  out  fi        To work correctly  the two binaries have to been sequential  not using MPI    Then run your script  making sure to submit two MPI processes with 4 cores per process       cat multi_jobs_gpush       bir ba sh    M5 UB  r jobs _gpu    MS UB n2  2 tasks    MS UB N1   1node    M5 UB c 4  eachtasktakes 4 cores   M5 UB  q hybrid    M5 UB  T 1800    M5 UB  o mutti_jobs _gpu_ l out    MS UB  e multi jobs _gpu_ l out    set x  cd  BRIDGE_MS UB PWD  e xport OMP_NUM THREADS  4    ccc_mprun  E    wa it 0  n 2  c 4   la unch_exe sh     E    wa it 0  spe cify to s lum to not kill the job if one of the two processes is termina ted and not the second    So your first process will be located on the first CPU socket and the second process will be on the second CPU    socket  each socket is linked with a GPU        ccc_ms ub multi_jobs_gpu s h  Profiling    PAPI    PAPI is an API which allows you to retrieve hardware counters from the CPU  Here an example in Fortran to get the    number of floating point operations of a matrix DAXPY     progra m ma in  implicit none  include f90pa pi h   i    inte ger  parameter    size   1000  inte ger  parameter    ntimes   10  double pre cision  dime nsion size    
18. a    MS UB  n 32    MS UB  x    M5 UB  T 1800    M5 UB  o e xa mple_ l o   M5 UB  A pa xxxx    M5 UB  E    cpu _bind none        Request name    Number of tasks to use    Require a exclusive node    Hla pse d time limit in s e conds    S ta nda rd output   l is the job id    Proje ct ID    Disable de fa ult S LURM binding    mopirun   bys ocke t  np 32   a  out    Cyclic distribution by node         bir ba sh    M5 UB r MyJob Pa ra    MS UB  n 32    MS UB  N 16    MS UB  x    M5 UB  T 1800    M5 UB  o e xa mple_ l o   M5 UB  A pa xxxx    M5 UB  E    cpu _bind none        Request name    Number of tasks to use      Require exclusive nodes    Hla pse d time limit in s e conds    S ta nda rd output   l is the job id    Proje ct ID    Disable de fa ult S LURM binding    mopirun   by node  np 32   a  out    Process binding    No binding         biryba sh    M5 UB  r MyJob Pa ra   M5 UB  n 32    M5 UB  x    M5 UB  T 1800    M5 UB  o e xa mple_ l o   M5 UB  A pa xxxx     M5 UB  E    cpu _bind none        Request name    Number of tasks to use    Re quire a exclusive node    Ha pse d time limit ins e conds    S ta nda rd output   lis the job id    Proje ct ID    Disable de fa ult S LURM binding    mopirun   bind to none  np 32   a  out    Core binding         bir ba sh    M5 UB r MyJob Pa ra   MS UB  n 32    MS UB  x    MS UB  T 1800    MS UB  o e xa mple _ l 0   M5 UB  A pa xxxx     M5 UB  E    cpu _bind none        Request name    Number of tasks to use    Require a exclusive node 
19. b id   MS UB e exa mpe _ l e   Error output   lis the job id  set x    cd   BRIDGE_MS UB PWD     e xport S CAN_MPLLAUNCHER ccc_mprun  scalasca  analyze ccc_mprun   prog e xe    At the end of execution  the program generates a directory which contains the profiling files      ba sh 4 00  Is e pik      To visualize those files  you can type     bash4 00 scalasca  examine e pik          Cubes  31Q1 epik helmholtz 0lsum summary  cube ga  sur titaneoo7     Ele Display Topology Help                                       Absolute    Own root percent  Metre wee Calltee   Flat view  T000 Time fal   T1000 MP Line   E 6 87 MP  UD ooo TRACING 3 07 Process 0   0 00 Synchronization 1000 MPI Iniiaiized 230 Process 1   L 0 13 Colective  0 00 MPt_Ccomm_dup 231 Process 2   T 0 00 Communication  000 MPIAlreduce 196 Process 3    45 11 Point to point  63 20 Colective     0 00 MPL_Comm_sank  loco MPLComm_size    116 76 inex  000 MPLBcast 2 27 Process 6   W652 Overhead ooo MPLComm free 232 Process 7  fl 8 587 vists 093 sub_wrtemesh_ ane1519    E10 Synchronizations  To Point to point   203 Sends   203 Receives    252 Collective  GC  0 Communications     002 sub_renum2_   0 00 sub_renum_   0 00 sub_param_phys_  E 0 00 sub defghd_   E 0 00 sub defgd_    2 52 Process 8  2 78 Process 9   2 60 Process 10  2 55 Process 11  2 45 Process 12       1127 sub_defmassma_   Wh 2 47 Process 13    Do Poin to point L tml 0 00 phadorders_   Hh 261 Process 14  E 5 98e4 Sends E 0 07 sub  shsheimhokz_ L 2 76 Process 15 
20. cess a GPU if this process is not on the same socket of the GPU  By default  the distribution is block by core   Then the MPI rank 0 is located on the first socket and the MPI rank 1 is on the first socket too  The majority of GPU  codes will assign GPU O to MPI rank O and GPU 1 to MPI rank 1  In this case  the bandwidth between MPI rank 1 and  GPU 1 is not optimal     If your code does this  in order to obtain the best performance  you should      e use the block cyclic distribution  e if you intend to use only 2 MPI processes per node  you can reserve 4 cores per process with the directive   MSUB  c 4  The two processes will be placed on two different sockets     Process binding    By default  processes are bound to the core  For multi threaded jobs  processes creates threads  these threads  will be bound to the assigned core  To allow these threads to use other cores  SLURM provides the option  c to  assign many cores to a process         biryba sh    M5 UB  r MyJob Para   Re quest name    M5 UB n8   Numbe r of tasks to use    M5 UB c 4   Assign 4 cores per process   MS UB  T 1800   Ha pse d time limit ins e conds     M5 UB  o e xa mple_ l o   Sta nda rd output   l is the job id   M5 UB  A pa Xxxx   Proje ct ID    e xport OMP_NUM THREADS  4  ccc mpnun   a  out    In this example  our hybrid OpenMP MPI code runs on 8 MPI processes and each process will use 4 OpenMP  threads  We give here an example for the output with the verbose option for binding       ccc_mpnun   a  out 
21. e current MPI rank     BullxMPI  MPMD jobs    BullxMPI  or OpenMPI  jobs can be launched with mpirun launcher  In this case  we have other ways to launch MPMD  jobs  see embarrassingly parallel jobs section      We take the same example in the embarrassingly parallel jobs section  There are then two ways for launching  MPMD scripts    e We don t need the  Jaunch_exe sh anymore  We can launch directly the job with mpirun command     mpirun  np 1   bin0    np 1   bin1    np 1   bin2    e In the launch_exe sh  we can replace SLURM_PROCID by OMPI_COMM_WORLD_RANK        launch_exe sh         bir ba sh  if     OMPI COMM WORLD RANK   e q0   the n     bin0  fi  if     OMPI COMM WORLD RANK   eq 1   the n      bin1  fi  if     OMPI COMM WORLD RANK   e q 2    the n    bin2  fi    We can then launch our job with 3 processes     mpirun  np 3   la unch e xe  sh    Tuning BullxMPI    BullxMPI is based on OpenMPI  It can be tuned with parameters  The command ompi_info  a gives you a list of all  parameters and their descriptions     curie 50  ompi_info  a         MCA mpi  pa ra me te r  mpi_s how_mca _pa ra ms   cure nt va lue    lt none  gt   da ta source   de fa ult va lue      Whe the r to s how a ll MCA pa ra me te r va lue s during MPLINIT or not  good for re produca bility of MPI jobs for de bug purposes   Acce pte d values are all  de fa ult  file  a pi  a nd e nvironme nt   ora comma de limite d combina tion of the m    Theses parameters can be modified with environment variables set
22. e tests  like autotools configure   you  can submit a job which will compile on the Curie thin nodes     GNU    There are some options which allow usage of specific set of instructions for Intel processors  in order to optimize  code behavior  These options are compatible with most of Intel processors  The compiler will try to use these  instructions if the processor allow it      mmmx    mno mmx   Switch on or off the usage of said instruction set    msse    mno sse   idem     msse2    mno sse2   idem     msse3    mno sse3   idem     mssse3    mno ssse3   idem     msse4 1    mno sse4 1   idem     msse4 2    mno sse4 2   idem     msse4    mno sse4   idem     mavx    mno avx   idem  for Curie Thin nodes partition only     Submission    Choosing or excluding nodes    SLURM provides the possibility to choose or exclude any nodes in the reservation for your job   To choose nodes         bir ba sh     M5 UB  r MyJob Para   Re que st name    M5 UB  n 32   Number of tasks to use    M5 UB  T 1800   Ha pse d time limit in se conds    M5 UB  o e xa mple_ l o   Sta nda rd output   l is the job id   MS UB e exa mpe _ l e   Error output   l is the job id   MS UB  A pa Xxxx   Proje ct ID     M5 UB  E   w curie  1000 1003     Include 4 nodes  curie 1000 to curie 1003     set  x  cd   BRIDGE_MS UB PWD   ccc_mprun   a  out    To exclude nodes         bir ba sh     M5 UB  r MyJob Para   Re que st name    M5 UB  n 32   Number of tasks to use    M5 UB  T 1800   Ha pse d time limit in se conds   
23. e to access a GPU which is not connected to its socket        NUMA node   Curie hybrid node with GPU    For all theses reasons  it is better to know the NUMA configuration of Curie nodes  fat  hybrid and thin   In the  following section  we will present some ways to tune your processes affinity for your jobs     CPU affinity mask    The affinity of a process is defined by a mask  A mask is a binary value which length is defined by the number of  cores available on a node  By example  Curie hybrid nodes have 8 cores  the binary mask value will have 8 figures   Each figures will have 0 or 1  The process will run only on the core which have 1 as value  A binary mask must be  read from right to left    For example  a process which runs on the cores 0 4 6 and 7 will have as affinity binary mask  11010001    SLURM and BullxMPI use theses masks but converted in hexadecimal number   e To convert a binary value to hexadecimal       e cho  iba s e  2 0ba s e  16 11010001   bc  21202    e To convert a hexadecimal value to binary       e cho  iba se  16 oba se  2 21202   bc  11010001    The numbering of the cores is the PU number from the output of hwloc     SLURM    SLURM is the default launcher for jobs on Curie  SLURM manages the processes even for sequential jobs  We  recommend you to use ccc_mprun  By default  SLURM binds processes to a core  The distribution is block by node  and by core     The option  E    cpu_bind verbose  for ccc_mprun gives you a report about the binding of pr
24. es each with 8 cores  A block distribution request will distribute those tasks to the nodes with tasks 0  to 7 on the first node  task 8 to 15 on the second node        Block distribution by core    e cyclic by socket  from SLURM manpage  the cyclic distribution method will distribute tasks to a socket such  that consecutive tasks are distributed over consecutive socket  in a round robin fashion   For example   consider an allocation of two nodes each with 2 sockets each with 4 cores  A cyclic distribution by socket  request will distribute those tasks to the socket with tasks 0 2 4 6 on the first socket  task 1 3 5 7 on the  second socket  In the following image  the distribution is cyclic by socket and block by node        Cyclic distribution by socket    e cyclic by node  from SLURM manpage  the cyclic distribution method will distribute tasks to a node such that  consecutive tasks are distributed over consecutive nodes  in a round robin fashion   For example  consider  an allocation of two nodes each with 2 sockets each with 4 cores  A cyclic distribution by node request will  distribute those tasks to the nodes with tasks 0 2 4 6 8 10 12 14 on the first node  task 1 3 5 7 9 11 13 15 on  the second node  In the following image  the distribution is cyclic by node and block by socket        Block distribution by node    Why is affinity important for improving performance      Curie nodes are NUMA  Non Uniform Memory Access  nodes  It means that it will take longer to access
25. etrieve only 3  hardware counters at the same time on Curie  The the syntax is     e xport EPK_METRICS   PAPI_FP_OPS  PAPI_TOT_CYC     Paraver    Paraver is a flexible performance visualization and analysis tool that can be used to analyze MPI  OpenMP   MPI OpenMP  hardware counters profile  Operating system activity and many other things you may think of     In order to use Paraver tools  you need to load the paraver module     ba sh 4 00   module loa d para ver  ba sh 4 00   module show pa raver    Jus r loca Vccc_us e rs _e nv module s  de ve lopme nt pa ra ve 1 4 1 1     module  wha tis Pa raver   conflict pa raver   pre pe nd pa th PATH  us r loca l pa ra ve r 4 1 1 bin   pre pe ncd pa th PATH  us r loca Ve xtra e  2 1 1 bin   pre pe nd pa th LD_LIBRARY_PATH  us r loca V pa ra ve r 4 1 1 lib  pre pe nd pa th LD_LIBRARY_PATH  us r loca Ve xtra e  2 1 1 lib  module loa d pa pi   setenv PARAVER_HOME  us r loca Vpa ra ve r 4 1 1   setenv EXTRAE HOME  us r loca Ve xtra e  2 1 1   setenv EXTRAE LIB DIR  us r loca Ve xtra e  2 1 1 lib  setenv MPLTRACE_LIBS  us r loca Ve xtra e  2 1 1 lib libmpitra ce  s o    Trace generation    The simpliest way to activate mpi instrumentation of your code is to dynamically load the library before execution   This can be done by adding the following line to your submission script     export LD PRELOAD  LD_PRELOAD  MPI_TRACE LIBS    The instrumentation process is managed by Extrae and also need a configuration file in xml format  You will have t
26. ir    ccc ms ub va mpirse rver sh    When the job is running  you will obtain this ouput       ccc_mpp   US ER ACCOUNT BATCHID NCPU QUEUE PRIORITY STATE RLM RUN START SUSP OLD NAME NODES  toto genXXX 234481 32 large 210332 RUN 30 0m 13m   13m vamprserver curie 1352    ccc mpe e k 234481   Found lice nse file    us r loca  va mpir 7  3 bin lic da t   Running 31 a na lysis processes     a bort with Ctri C or vngd s hutdown    Server liste ns on  curie 1352 30000    In our example  the Vampirserver master node is on curie1352  The port to connect is 30000  Then you can launch  Vampir on front node  Instead of clicking on Open  you will click on Remote Open      lt j    7    l     indow Hep    W Connect to Server  sur curie50        Deserption                Deserption     Server  cure 1352  Pon  30000                            Connecting to Vampirserver  Fill the server and the port  You will be connected to vampirserver  Then you can open an OTF files and visualize it   Notes    e You can ask any number of processors you want  it will be faster if your profiling files are big  But be careful     it consumes your computing times   e Don t forget to delete the Vampirserver job after your analyze     CUDA profiling    Vampirtrace can collect profiling data from CUDA programs  As previously  you have to replace compilers by  Vampirtrace wrappers  NVCC compiler should be replaced by vtnvcc  Then  when you run your program  you have to  set an environment variable        e xport e x
27. launch the run  Here is the previous modified script         bin ba sh    M5 UB  r MyJob Para   Re quest name    M5 UB  n 32   Number of tasks to use    M5 UB  T 1800   Ha pse d time limit in se conds    MS UB  o e xa mple _ l o   S ta nda rd output   lis the job id   MS UB e example _ le   Error output   l is the job id  set x    cd   BRIDGE_MS UB PWD     scalasca  analyze  t mpirun   prog e xe    At the end of execution  the program generates a directory which contains the profiling files      bash 4 00  ls epik      To visualize those files  you can visualize them as previously  To generate the OTF trace files  you can type     bash 4 00  ls epik    ba sh 4 00   e Ig2otf e pik      It will generate an OTF file under the epik_  directory  To visualize it  you can load Vampir     ba sh 4 00   module loa d va mpir  ba sh 4 00   va mpir e pik _  a  otf    Scalasca   PAPI    Scalasca can retrieve the hardware counter with PAPI  For example  if you want retrieve the number of floating  point operations          bir ba sh    M5 UB  r MyJob Para   Re que st na me    M5 UB  n 32   Number of tasks to use    MS UB  T 1800   Ha pse d time limit in se conds    M5 UB  o e xa mple_ l o   Sta nda rd output   l is the job id   MS UB e example _ le   Error output   l is the job id    set x  cd   BRIDGE_MS UB PWD     export EPK_METRICS  PAPI_FP_OPS  scalasca  analyze mpirun   prog e xe    Then the number of floating point operations will appear on the profile when you visualize it  You can r
28. ne      hos tna me  gt  hostfile  txt      Re quest name    Number of tasks to use    Re quire a exclusive node    Hla pse d time limit ins e conds    S ta nda rd output   l is the job id    Proje ct ID    Disable de fa ult S LURM binding    e cho  ra nk O   HOS TNANE  slot 0 1 2 3   gt  rankfile txt   e cho  ra nk 1   HOS TNANE  s lot 8  10 12 14   gt  gt  ra nkfile  txt  e cho  ra nk 2   HOS TNAME  s lot 16 17 22 23   gt  gt  ra nkfile  txt  e cho  ra nk 3   HOS TNAME  s lot 19 20 21 31   gt  gt  ra nkfile  txt  mpirun   hos tfile_hos tfile txt   ra nkfile_ra nkfile  txt  np 4   a  out    In this example  there are many steps      e You have to create a hostfile here hostfile txt where you put the hostname of all nodes your run will use   e You have to create a rankfile here rankfile txt where you assign to each MPI rank the core where it can run   In our example  the process of rank 0 will have as affinity the core 0 1 2 and 3  etc    Be careful  the  numbering of the core is different than the hwloc output  on Curie fat node  the eight first core are on the    first socket 0  etc     e you can launch mpirun by specifying the hostfile and the rankfile     Using GPU    Two sequential GPU runs on a single hybrid node    To launch two separate sequential GPU runs ona single hybrid node  you have to set the environment variable  CUDA_VISIBLE_DEVICES which enables GPUs wanted  First  create a script to launch binaries      catlaunchexe sh        bir ba sh  set x    e xport CUDA
29. o  add next line to your submission script     e xport EXTRAE_CONFIG_FILE   e xtra e _config file  xml    All detailled about how to write a config file are available in Extrae s manual which you can reach at     EXTRAE_HOME doc user guide pdf  You will also find many examples of scripts in  EXTRAE_HOME examples LINUX  file tree     You can also add some manual instrumentation in your code to add some specific user event  This is mandatory if  you want to see your own functions in Paraver timelines     If trace generation succeed during computation  you ll find a directory set O containing some  mpit files in your  working directory  You will also find a TRACE mpits file which lists all these files     Converting traces to Paraver format    Extrae provides a tool named mpi2prv to convert mpit files into a  prv which will be read by Paraver  Since it can be  a long operation  we recommend you to use the parallel version of this tool  mpimpi2prv  You will need less  processes than previously used to compute  An example script is provided below     ba s h 4 00  ca t re build s h   M5 UB  r me rge    M5 UB  n8    M5 UB  T 1800    set x  cd  BRIDGE_MS UB PWD  ccc mpun mpimpi2 prv  syn  e_pa th to your_bina ry  f TRACE  mpits  o file _to be _a na lys e d prv    Launching Paraver    You just now have to launch  paraver file_to_be_analysed prv   As Paraver may ask for high memory  amp  CPU usage   it may be better to launch it through a submission script  do not forget then to activa
30. ocesses before the run       ccc_moprun  E    cpu_bind ve rbos e    q hybrid  n 8   a  out   cpu bind MAS K   curie 7054  task 3 3  3534   mask Ox8set  cpu_bind MAS K   curie 7054  task 0 0 3531   mask Oxlset  cpu_bind MAS K   curie 7054  task 1 1 3532   mask 0x2 set  cpu_bind MAS K   curie 7054  task 2 2  3533   mask 0x4 set  cpu_bind MAS K   curie 7054  task 4 4  3535   mask 0x10set  cpu_bind MAS K   curie 7054  task 5 5 3536   mask Ox20set  cpu_bind MAS K   curie 7054  task 7 7  3538   mask Ox80set  cpu bind MAS K   curie 7054  task 6 6 3537   mask Ox40set    In this example  we can see the process 5 has 20 as hexadecimal mask or 00100000 as binary mask  the 5th    process will run only on the core 5     Process distribution    To change the default distribution of processes  you can use the option  E   m  for ccc_mprun  With SLURM  you have  two levels for process distribution  node and socket     e Node block distribution     ccc_mprun  E   m block      a  out    e Node cyclic distribution   ccc_mprun  E   m cyclic      a  out    By default  the distribution over the socket is block  In the following examples for socket distribution  the node  distribution will be block     e Socket block distribution       cc_moprun  E   m block  block      a  out    e Socket cyclic distribution     ccc_mprun  E   m block  cy clic      a    out    Curie hybrid node    On Curie hybrid node  each GPU is connected to a socket  See previous picture   It will take longer for a process to  ac
31. port VI_CUDARTTRACE ye s  ccc_mprun   prog e xe    Scalasca    Scalasca is a set of software which let you profile your parallel code by taking traces during the execution of the  program  This software is a kind of parallel gprof with more information  We present here an introduction of  Scalasca     Standard utilization    First  you must compile your code by adding Scalasca tool before your call of the compiler  In order to use Scalasca   you need to load the scalasca module        ba sh 4 00   module loa d sca la sca  ba sh 4 00   s ca la sca  ins trume nt mpicc  c prog c  ba sh 4 00   sca la s ca _ ins trume nt mpicc  o prog e xe _prog o       or for Fortran         ba sh 4 00   module loa d sca la sca  ba sh 4 00   sca la sca  ins trume nt mpif90  c prog f90  ba sh 4 00   s ca la s ca _ ins trume nt mpif90  o prog e xe prog o    You can compile for OpenMP programs     ba sh 4 00   sca la sca  ins trume nt ifort  ope nmp  c prog f90  ba sh 4 00   sca la sca  ins trume nt ifort  ope nmp  o prog e xe prog o    You can profile hybrid programs     ba sh 4 00   sca la sca  ins trume nt mpif90  ope nmp  03  c prog f90  ba sh 4 00   sca la s ca _ ins trume nt mpif90  ope nmp  03  o prog e xe prog o    Then you can submit your job  Here is an example of submission script         biryba sh    M5 UB  r MyJob Para   Re que st name    M5 UB  n 32   Number of tasks to use    M5 UB  T 1800   Ea pse d time limit in s e conds    MS UB  o e xa mple _ l o   S ta nda rd output   lis the jo
32. size      A B C  inte ger    ij n    Va ria ble PAPI  inte ger  pa ra meter    max_event 1  inte ge r  dime ns ion ma x_e ve rt     eve nt  inte ger    num e ve nts  retval  inte ge r kind 8   dime ns ion ma x_e ve nt     va lue s    Init PAPI  ca Il PAPIf_num_counte rs   num e ve nts    print     Numbe r of ha rawa re counte rs supporte d     num e ve nts  ca Il PAPIf_que ry_e ve nt PAPI_FP_INS  re tva     if  re tva    NE  PAPI_OK  the n   eve nt 1    PAPLTOT_INS  else     Tota   floa ting point ope ra tions   eve nt 1    PAPI FP_INS  endif  1 Init Ma trix  doi 1size   do j 1size   Cli    re a l i j 8   Bi     i 0 1 j  end do    enddo    Set up courte rs  num events   1  ca Il PAPIf_s ta rt_counte rs   e ve nt  num e ve nts  re tva       Clea r the counte r va lue s  ca    PAPIf_re a d_counte rs  va lue s  num e ve nts  re tva       DAXPY  do n 1 ntime s  do i 1 size  doj 1 size  A ij    2 0 B ij    Cli   end do    end do  enddo    Stop the counte rs and put the results inthe ama y values  ca Il PAPIf_s top_counte rs  va lue s num e ve nts    re tva       Print re sults  if  e ve nt 1   EQ  PAPLTOT_INS   the n  print       TOT Instructions    va lue s  1   else  print     FP Instructions    va lue s  1   endif  e nd progra m ma in    To compile  you have to load the PAPI module      ba sh 4 00   module loa d pa pi 4 1 3   ba sh 4 00   ifort  I  PAPI_INC_DIR  pa pi f90   PAPI_LIBS    ba sh 4 00     a  out   Numbe r of ha rdwa re counte rs supporte d     FP Ins tructions   10
33. te the  X option in ccc_msub    For analyzing your data you will need some configurations files available in Paraver s browser under   PARAVER_HOME cfgs directory      A Paraver  sur curie50  MIE   Eile Help    oR xi       Window browser        scratch cont000 s8 dahmt termX_bench3D Test traces_termx_small prv  z   I  User calls   Instructions per cycle          _ jie    yy     amp  MPI call activity  E Total MPI activity protile   amp  MPI call duration                            MPI cal g e ENE E EEEE Jg       MPI call duration li K  IS vo      I  1 0 begin       l 1  I  1 0 end i    F VO end  I  Instantaneous parallelism profile pment       il    Fibs  amp  Window Properties   ee  He   gt  E abinit_test2 I     l   gt  B BencHs_ePus         aan   gt  E cudago   gt  BS extrae 2 1 1 p2   gt  E libunwind 1 0 1 What   Where   Timing   colors     gt  Bimagmat o 0 Semantic  Z  Events  2  Communications  Z  Previous  Next  Z  Text  E my_Extrae   gt  E My_vasp Object  THREAD 1 3 1 Click time  7 993 045 us   gt  BS paraver source   gt  Ba wxparavers4 MPI_Irecv Duration  22 47 us   gt  Bon End Duration  753 91 us  v BE  Parverties   MPI_Waitall Duration  763 75 us                      Paraver window    
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
FSIDE-OTP Writer User's Manual  Wolf Multi-Function Cooktop User's Manual  Phoenix Contact PDF - Lesman Instrument Company  ISTRUZIONI PER L`USO  Now  INSTALLATION MANUAL MANUEL D`INSTALLATION  Toshiba Satellite L645D-S4106  advertencia  SS-26, SS-36 & SS-26E  DJ-PB20 カタログ    Copyright © All rights reserved. 
   Failed to retrieve file