Home
        SMP superscalar user`s manual v.1.x
         Contents
1.     With the objective of tuning the behaviour of the SMPSs runtime  a configuration file where  some variables are set is introduced  However  we do not recommend to play with this  variables unless the user considers that it is required to improve the performance of her his  applications  The current set of variables is the following  values in parenthesis denote the    default      task graph task count_high_mark 1000   defines the maximum number of non  executed tasks that the graph will hold  The purpose of this variable is to control    the memory usage     task graph task count_low_mark 900      whenever the task graph reaches    task _graph task count_high_mark tasks  the task graph generation is suspended    until the number of  task _graph task count_low_mark     non executed tasks goes below    These variables are set in a plain text file  with the following syntax     task _graph task count_high_mark   2000    task _graph task count_low_mark   1500    The file where the variables are set is indicated by setting the CSS CONFIG FILE environment  variable  For example  if the file    file cfg    contains the above variable settings  the following    command can be used    gt  export CSS CONFIG FILE file cfg    12       9 References    1  SMP Superscalar website  http   www bsc es plantillaG php cat_id 385    2  Josep M  Perez  Rosa M  Badia and Jesus Labarta  A flexible and portable _  programming model for SMP and multi cores    Technical report 03 2007     Barcelon
2.   depending on the data dependencies   The runtime will find the data dependencies and will determine  based on them  which  functions can be run in parallel with others and which are not  Therefore  SMPSs provides  programmers with a more flexible programming model with an adaptive parallelism level  depending on the application input data     2 Installation    SMP Superscalar is distributed in source code form and must be compiled and installed before  using it     2 1 Compilation requirements   The SMP Superscalar compilation process requires the following system components   e GCC 4 0 or later  e GNU make    e Optional  automake  autoconf  libtool  bison  flex    2 2 Compilation  To compile and install SMP Superscalar please follow the following steps   1  Decompres the source tarball  S tar  Xvzt smpss  1 0 tars gz  2  Enter into the source directory  ed Sips s     Ll  0    3  Run the configure script  specifying the installation directory as the prefix argument   More information can be obtained by running   configure   help       configure   prefix  opt SMPSS    4  Run make    make  5  Run make install    make install    2 3 User environment    If SMP Superscalar has not been installed into a system directory  then the user must set the  following environment variables     1  The PATH environment variable must contain the bin subdirectory of the installation   export PATH SPATH  opt SMPSS bin    2  The LD LIBRARY PATH environment variable must contain the  ib subdirect
3.  n  output result   void factorial unsigned int n  unsigned int  result      result   1     for   n  gt  1  n          result    result   n     The next example  has two input vectors    left     of size    leftSize     and    right     of size     rightSize     and a single output    result    of size    leftSize rightSize         pragma css task input left leftSize   right rightSize   output result leftsize rightSize    void merge float  left  unsigned int leftSize  float  right  unsigned int rightSize   float  result       The next example shows another feature  In this case  with the keyword    highpriority    the user  is giving hints to the scheduler  the  u0 tasks will be  when data dependencies allow it   executed before the ones that are not marked as high priority      pragma css task highpriority inout diag   void luO float diag 64  64        Waiting on data  Notation      pragma css wait on  lt  ist of expressions gt      On clause  Comma separated list of expressions corresponding to the addresses that the system  will wait for     In Example 1 the vector    data    is generated by bubblesort  The wait pragma waits for this    function to finish before printing the result     Example 1    pragma css task inout data size   input size   void bubblesort float  data  unsigned int size       void main        bubblesort data  size     pragma css wait on data   for  unsigned int i 0  i  lt  size  i    4  printf   f    datali       In Example 2  matrix N  N  is a 2 dimen
4.  thread purpose is to populate the graph in order to feed tasks to the worker threads   Nevertheless  it may stop generating new tasks for several conditions  too many tasks in the  graph  a wait on  a barrier or and end of program  In those situations it follows the same role as  the worker threads by consuming the tasks until the blocking condition is no longer valid     8 Advanced features    8 1 Using Paraver    To understand the behavior and performance of the applications  the user can generate Paraver  tracefiles of their SMP Superscalar applications     If the  t  tracing flag is enabled at compilation time  the application will generate a Paraver  tracefile of the execution  The default name for the tracefile is    gss trace id prv     The name can  be changed by setting the environment variable CSS TRACE FILENAME  For example  if it is set  as follows      gt  export CSS_TRACE_FILENAME tracefile    after the execution  the files  tracefile 0001 row  tracefile 0001 prv and tracefile 0001 pcf are  generated  All these files are required by the Paraver tool     The traces generated by SMP Superscalar can be visualized and analyzed with Paraver  Paraver  is distributed independently of SMP Superscalar and can be obtained from     http   www cepba upc es paraver     Several configuration files to visualise and analyse SMP superscalar tracefiles are provided in  the SMP Superscalar distribution in the directory  lt install dir gt  share cellss paraver_cfgs      The follow
5. SMP Superscalar  SMPSs  User s Manual    Version 1 0  Barcelona Supercomputing Center  July 2007    Barcelona  Supercomputing   Center   Centro Nacional de Supercomputacion    Table of Contents    APOC en O le ana 3  PAW els  toll O 6 eee E O EONS ON On Smee eee RE NC eee ae Ren eee er ee 3  Z VCOMPilaliOn redu ementes ida 3  ZAC OM Wal 07 a CENA PU on e auc tan A saan 3  Zoser CN VITO MING INU aras 4  ZAC OMPUCr CAVIFONIMEN esene a a 4  SPTOG FamminG  Will SMPS S esere E EAA 4  Dil Pas KSC le CLIO Dieren r I E 4  Dees NE DSS YUU NC aa TE E E E N 5  ACCOMP NG urnas 7  ASAS E AE EEN E E E A E EA O  AERE aa O shee oosoun ead aaaioee oezonooenasenae tents 8  oSetting the environment and executing          esssesssssesreeseseesseressersereesrereseeesseeee  9  5 1Setting the number of CPUs and executingQ       ococcnconcncononcnnoncnnononconannnonoso 9  Programming examples E S E EE 9  Maltrato 9  Internals SMP SUDers Calas ios 10  SAdvanced te aturdido 11  o VUSING PaTAVO Er idasl 11  02 CONMOUT AVION  FING rd 12  DIRECTO TL CIC CS arroces tienta 13    1 Introduction    This document is the user manual of the SMP Superscalar framework  SMPSs   which is based  on a source to source compiler and a runtime library  The supported programming model allows  the programmers to write sequential applications and the framework is able to exploit the  existing concurrency and to use the different cores of a multi core or SMP by means of  automatic parallelization at execution time  The requ
6. a Supercomuputing Center   Centro Nacional de Supercomputaci  n  June 2007     3  Paraver website  www cepba upc edu paraver    13    
7. atform compiler  and linker  The list of supported options is the following      gt  mcc  help     Dmacro Option passed the preprocessors   SQ Option passed to the native compilers    g 3 Option passed to the native compilers    hy he lo Prints Thais  1ntormet Lon   Tie Option passed to the native preprocessors    k   keep Keep temporary files      Llabirek y Option passed to the native compiler   SLAE Option passed to the native compiler    noincludes Don t try to regenerate include directives     Option passed to the native compilers     Option passed to the native compilers    02 Option passed to the native compilers    03 Option passed to the native compilers     ofile Sets the name of the output file       t  trLacing Enable program tracing    v  verbose Enables some informational messages    Wc  OPTIONS Comma separated list of options passed to the native  compiler      Wi  OPTIONS Comma separated list of options passed to the native linker      Wp  OPTIONS Comma separated list of options passed to the native  preprocessor    4 2 Examples   gt  mec  05 matmulsc  o     mati     Compilation of application file matmul c with  O3 optimization level  If there are no compilation  errors  the executable file    matmul    is created which can be called from the command line      gt   se AGE ML zo     gt  mecc  keep cholesky c  o cholesky    Compilation with  keep option of cholesky c application  Option  keep will not delete the  intermediate files  files generated by the prepr
8. es    lt install dir gt  being the installation directory  More  examples are provided in this directory also     6 1 Matrix multiply    This example presents a SMPSs code for a block matrix multiply  The block size is of 64 x 64  floats      pragma css task input A  B  inout C   static void block _addmultiply float C BS  BS   float A BS  BS   float B BS  BS      int i  j  K     for  i 0  i  lt  BS  i     for  j 0  j  lt  BS  j     for  k 0  k  lt  BS  k     CHII    Ali  k    BIKI      int main int argc  char   argv     int i  j  K     initialize argc  argv  A  B  C     pragma css start  for  i 0  i  lt  N  i       for  j 0  j  lt  N  j     for  k 0  k  lt  N  k     block addmultiply  Cl11 j1  AliJ k   BLKI j             pragma css finish    This main code will run in the main thread while the block _addmultiply calls will be executed in  all the threads  It is important to note that the sequential code  including the annotations  can  be compiled with the native compiler  obtaining a sequential binary  This is very useful for  debugging the algorithms     7 Internals SMP Superscalar    CPU  CPU  CPU     Main thread Worker thread 1 Worker thread 2       SMPSs runtime library SMPSs runtime library SI  User main     z     program    o Scheduling   Original   Scheduling    Original     Task execution   A   Task execution      Original    task code          task code                           Thread 1          Thread 2    Ready task queue       Ready task q    Global  Ready task 
9. ing table summarizes what is shown by each configuration file        Configuration file Feature shown       execution phases cfg Profile of percentage of time spent by each  thread  master  worker  at each of the major  phases in the run time library  ie  generating  tasks  scheduling  task execution             flushing cfg Intervals  dark blue  where each thread is  flushing its local trace buffer to disk  The  effect of the flushing overhead on the main  thread is of significance since it prevents the  main thread from adding newer tasks to the                11    graph  This could lead to starvation on the  worker threads that would not happen when  running without tracing        task cfg    Outlined function being executed by each SPE       task_number cfg    Number  in order of task generation  of task  being executed by each SPE     Light green for the initial tasks in program  order  blue for the last tasks in program order     Intermixed green and blue indicate out of  order execution        Task_profile cfg    Time  microseconds  each thread    executing the different task     spent    Change statistic to      Fburst  number of tasks of each type by  thread      Average burst time  Avg  duration of task  type       3dh_ duration _tasks cfg          Histogram of duration of SPE tasks    One plane per task  Fixed Value Selector    Left column  O microseconds  right column  3000 ms   Darker   duration     higher number of instances of that       8 2 Configuration file
10. irements we place on the programmer are  that the application is composed of coarse grained functions  for example  by applying  blocking  and that these functions do not have collateral effects  only local variables and  parameters are accessed   These functions are identified by annotations  Somehow similar to  the OpenMP ones   and the runtime will try to parallelize the execution of the annotated  functions  also called tasks      The source to source compiler separates the annotated functions from the main code and the  library calls the annotated code  However  an annotation before a function does not indicate  that this is a parallel region  as it does in OpenMP   The annotation just indicates the direction  of the parameters  input  output or and inout   To be able to exploit the parallelism  the SMPSs  runtime takes this information about the parameters and builds a data dependency graph  where each node represents an instance of an annotated function and edges between nodes  denote data dependencies  From this graph  the runtime is able to schedule for execution  independent nodes to different cores at the same time  Techniques imported from the  computer architecture area like the data dependency analysis  data renaming and data locality  exploitation are applied to increase the performance of the application     While OpenMP explicitly specifies what is parallel and what is not  with SMPSs what is specified  are functions whose invocations could be run in parallel
11. ocessor  object files        If there are no  compilation errors  the executable file    cholesky    is created      gt  mecc  02  t matmul c  o matmul    Compilation with the  t  tracing  feature  When executing    matmul     a tracefile of the execution  of the application will be generated      gt  mec  02  Wc  funroll loops  ftree vectorize  ftree vectorizer verbose 3  matnu  ul c  0  marmul    The list of flags after the  Wc are passed to the native compiler  for example c99   These  options perform automatic vectorization of the code  Note  vectorization seems to not work  properly on gcc with  O3     5 Setting the environment and executing    5 1 Setting the number of CPUs and executing    Before executing a SMP Superscalar application  the number of processors to be used in the  execution have to be defined  The default value is 2  but it can be set to a different number  with the CSS NUM_CPUS environment variable  for example      gt  export CSS_NUM_CPUS 8    SMP Superscalar applications are started from the command line in the same way as any other  application  For example  for the compilation examples of section 4 2  the applications can be  Started as follows      gt    matmul  lt pars gt      gt    cholesky  lt pars gt     6 Programming examples    This section presents a programming example for the block matrix multiplication  The code is  not complete  but you can find the complete and working code in the directory   lt install dir gt  share docs cellss exampl
12. ory from the  installation     export LD LIBRARY _PATH  LD_LIBRARY_ PATH   opt SMPSS lib    2 4 Compiler environment    The SMP Superscalar compiler requires the following programs to be available on the PATH  environment variable     e GNU indent    3 Programming with SMPSs    SMP Superscalar applications are based on the parallelization at task level of sequential  applications  The tasks  functions or subroutines  selected by the programmer will be executed  in the different cores  Furthermore  the runtime detects when tasks are data independent  between them and is able to schedule the simultaneous execution of several of them on  different cores  All the above mentioned actions  data dependency analysis and scheduling  are  performed transparently to the programmer  However  to take benefit of this automation  the  computations to be executed in the cores should be of certain granularity  about 80 usecs or  more   A limitation on the tasks is that they should only access their parameters and local  variables     3 1 Task selection    In the current version of SMP Superscalar it is a responsibility of the application programmer to  select tasks of a certain granularity  For example  blocking is a technique that can be applied to  increase the granularity of the tasks in applications that operate on matrices  Below there is a  Sample code for a block matrix multiplication     void block addmultiply double C BS  BS   double A BS  BS   double B BS  BS      int i  j  K     fo
13. queues           Thread 0   Ready task queue     I       Renaming table    High pri      THT  me E SS Low pri    Memory                               ni  at  oe A AAA T Ll   n DOTA O TT                         Work stealing q Work stealing  I   ca I I                Figure 1  SMP Superscalar runtime behavior    When compiling a SMPSs application with mcc the resulting object files are linked with the  SMPSs runtime library  Then  when the application is started  the SMPSSs runtime is  automatically invoked  The runtime is decoupled in two parts  one runs the main user code and  the other runs the tasks     The most important change in the original user code is that the SMPSs compiler replaces the  calls to the css addTask function whenever a call to an annotated function appears  At runtime   these calls to the css addTask function will be responsible for the intended behavior of the  application  At each call to css addTask  the main thread will do the following actions     e Anode that represents the called task is added to a task graph   e Data dependency analysis of the new task with other previously called tasks     e Parameter renaming  similarly to register renaming  a technique from the superscalar  processor area  we do renaming of the output and input output parameters  For every  function call that has a parameter that will be written  instead of writing to the original  parameter location  a new memory location will be used  that is  a new instance of that  parame
14. r  i 0  i  lt  BS  i     for  j 0  j  lt  BS  j     for  k 0  k  lt  BS  k     Clillj     Ali  k    BIKI I     3 2 SMPSs syntax    Starting and finishing SMPSs applications  The following optional pragmas indicate the scope of the program that will use the SMPSs    features      pragma css start   pragma css finish    When the start pragma Is reached all the threads are initiated and run until the finish pragma is  reached or the program finishes  Annotated functions must be called between this two  pragmas when present  If they are not present in the user code  the compiler will automatically  insert the start pragma at the beginning of the application and the finish pragma at the end     Specifying a task  Notation    pragma css task   input  lt input parameters gt   Jo    inout  lt  nout parameters gt   Jopt    output  lt output parameters gt   Jop    highpriority Jop     function declaration   function definition      Input clause  List of parameters whose input value will be read   Inout clause  List of parameters that will be read and writen by the task   Output clause  List of parameters that will be written to   Highpriority clause  Specifies that the task will be sent for execution earlier than tasks without the  highpriority clause     Parameter notation      lt parameter gt      lt dimension gt   J       Examples   In this example  the    factorial    task has a single input parameter    n    and a single output    parameter    result          pragma css task input
15. sion array of pointers to 2 dimension arrays of floats   Each of this 2 dimension arrays of floats are generated in the application from annotated  functions  The pragma waits on the address to each of these blocks before printing the result in  a file   Example 2    void write_matrix  FILE   file  matrix_t matrix        int rows  columns     int i  J  ii  jj     fprintf  file     din  d n   N   BSIZE  N   BSIZE    for  i   0  i  lt  N  i     for  ii   0  ii  lt  BSIZE  ii        for  j   0  j  lt  N  j       pragma css wait on matrix i  j    for  jj   0  jj  lt  BSIZE  jj     fprintf  file    f    matrix i  j  Lii  jj        fprintf  file  In       4 Compiling    All steps of the SMPSs compiler have been integrated in a single step compilation  called  through mcc and the corresponding compilation options  which are indicated in the usage  section below  The mcc compilation process consists in preprocessing the SMPSs pragmas   compiling for the native architecture with the corresponding compiler and linking with the  needed libraries  including the SMPSs libraries      The current version is only able to compile single source code applications  A way of  overcoming this limitation is to provide through libraries the code that does not contain  annotations and that Is not calling to annotated functions     4 1 Usage    The mcc compiler has been designed to mimic the options and behaviour of common C  compilers  We also provide a means to send non standart parameters to the pl
16. ter will be created and it will replace the original one  becoming a renaming of  the original parameter location  This allows to execute that function call independently  from any previous function call that would write or read that parameter  This technique  allows to effectively remove some data dependencies by using additional storage  and    10    thus improving the chances to extract more parallelism     Every thread has its own ready task queue  including the main thread  There is also a global  queue with priority  Whenever a task that has no predecessors is added to the graph  it is also  added to the global ready task queue     The worker threads consume ready tasks from the queues in the following order of preference   1  High priority tasks from the global queue   2  Tasks from its their own queue in LIFO order   3  Tasks from any other thread queue in FIFO order     Whenever a thread finishes executing a task  it checks what tasks have become ready and  adds them to its own queue  This allows the thread to continue exploring the same area of the  task graph unless there is a high priority task or that area has become empty     In order to preserve temporal locality  threads consume tasks of their own queue in LIFO order   which allows them to reuse output parameters to a certain degree  The task stealing policy  tries to minimise adverse effects on the cache by stealing in FIFO order  that is  it tries to steal  the coldest tasks of the stolen thread     The main
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
CVOA Magazine 99.pub - Cosworth Vega Owners Association`s    BEDIENUNGSANLEITUNG USER MANUAL MODE D'EMPLOI    VGN- BX series  印刷用  Dataram 32GB DDR3-1066  取扱説明書  2 - 取扱説明書ダウンロード  #61-830 Voltage Performance Monitor    Copyright © All rights reserved. 
   Failed to retrieve file