Home

CHAMELEON User's Guide - Index of

image

Contents

1. often referred to as the DAG Direct Acyclic Graph and the flexibility exploring the DAG at runtime Thus to a large extent dynamic scheduling is synonymous with runtime scheduling An important concept here is the one of the critical path which defines the upper bound on the achievable parallelism and needs to be pursued at the maximum speed This is in direct opposition to the fork and join or data parallel programming models where artificial synchronization points expose serial sections of the code where multiple cores are idle while sequential processing takes place The use of dynamic scheduling introduces a trade off though The more dynamic flexible scheduling is the more centralized and less scalable the scheduling mechanism is For that reason currently PLASMA uses two Chapter 1 Introduction to CHAMELEON 6 scheduling mechanisms one which is fully dynamic and one where work is assigned statically and dependency checks are done at runtime The first scheduling mechanism relies on unfolding a sliding window of the task graph at runtime and scheduling work by resolving data hazards Read After Write RAW Write After Read WAR and Write After Write WAW a technique analogous to instruction scheduling in superscalar processors It also relies on work stealing for balanding the load among all multiple cores The second scheduling mechanism relies on statically designating a path through the execution space of the al
2. page 4 2 Use the existing function MORSE Desc Create User it is more flexible than Desc Create because you can give your own way to access to tile data so that your tiles can be allocated wherever you want in memory see next paragraph Section 4 3 1 4 Step3 page 26 3 Create you own function to fill the descriptor If you understand well the meaning of each item of MORSE desc t you should be able to fill correctly the structure good luck In Step2 we use the first way to create the descriptor MORSE Desc Create amp descA NULL MorseRealDouble NB NB NB NB N N 0 0 N N LL e desc is the descriptor to create e The second argument is a pointer to existing data The existing data must follow LAPACK PLASMA matrix layout Section 1 2 2 2 Tile Data Layout page 4 1 D array column major if MORSE Desc Create is used to create the descriptor The MORSE_ Desc Create User function can be used if you have data organized differently This is discussed in the next paragraph Section 4 3 1 4 Step3 page 26 Giving a NULL pointer means you let the function allocate memory space This requires to copy your data in the memory allocated by the Desc Create This can be done with MORSE Lapack to Tile A N descA e Third argument of Desc Create is the datatype used for memory allocation e Fourth argument until sixth argument stand for respectively the number of rows NB columns NB in each tile the total number o
3. Caution about the compatibility CHAMELEON has been mainly tested with StarPU 1 1 releases Chapter 2 Installing CHAMELEON 9 2 1 2 8 hwloc hwloc Portable Hardware Locality is a software package for accessing the topology of a multicore system including components like cores sockets caches and NUMA nodes It allows to increase performance and to perform some topology aware scheduling hwloc is available in major distributions and for most OSes and can be downloaded from http www open mpi org software hwloc Caution about the compatibility hwloc should be compatible with the version of StarPU used 2 1 2 9 pthread POSIX threads library is required to run CHAMELEON on Unix like systems It is a standard component of any such system 2 1 3 Optional dependencies 2 1 3 1 OpenMPI OpenMPI is an open source Message Passing Interface implementation for execution on multiple nodes with distributed memory environment MPI can be enabled only if the runtime system chosen is StarPU default To use MPI through StarPU it is necessary to compile StarPU with MPI enabled Caution about the compatibility CHAMELEON has been mainly tested with OpenMPI releases from versions 1 4 to 1 6 2 1 3 2 Nvidia CUDA Toolkit Nvidia CUDA Toolkit provides a comprehensive development environment for C and C developers building GPU accelerated applications CHAMELEON can use a set of low level optimized kernels coming from cuBLAS to accelerate co
4. MORSE_Complex64_t alpha MORSE Complex64 t x int LDA MORSE Complex64 t xB int LDB MORSE Complex64 t beta MORSE Complex64 t C int LDC int MORSE zsyrk MORSE enum uplo MORSE enum trans int N int K MORSE_Complex64_t alpha MORSE Complex64 t A int LDA MORSE Complex64 t beta MORSE Complex64 t xC int LDC int MORSE zsyr2k MORSE enum uplo MORSE enum trans int N int K MORSE Complex64 t alpha MORSE Complex64 t A int LDA MORSE Complex64 t xB int LDB MORSE Complex64 t beta MORSE Complex64 t C int LDC int MORSE ztrmm MORSE enum side MORSE enum uplo MORSE enum transA MORSE enum diag int N int NRHS MORSE_Complex64_t alpha MORSE Complex64 t A int LDA MORSE Complex64 t B int LDB int MORSE ztrsm MORSE enum side MORSE enum uplo MORSE enum transA MORSE enum diag int N int NRHS MORSE_Complex64_t alpha MORSE Complex64 t A int LDA MORSE Complex64 t B int LDB int MORSE ztrsmpl int N int NRHS MORSE Complex64 t A int LDA MORSE desc t xdescL int IPIV MORSE Complex64 t xB int LDB Chapter 4 Using CHAMELEON 35 int MORSE ztrsmrv MORSE enum side MORSE enum uplo MORSE enum transA MORSE enum diag int N int NRHS MORSE Complex64 t alpha MORSE Complex64 t A int LDA MORSE Complex64 t B int LDB int MORSE ztrtri MORSE enum uplo MORSE enum diag int N MORSE Complex64 t A int LDA int MORSE zunglq int M int N int K MORSE Complex64 t A int LDA MORSE
5. Tile Async MORSE desc t x MORSE desc t T MORSE desc t B MORSE sequence t sequence MORSE request t request int MORSE zunmlq Tile Async MORSE enum side MORSE enum trans MORSE desc t A MORSE desc t T MORSE desc t B MORSE sequence t sequence MORSE request t request Chapter 4 Using CHAMELEON 43 int MORSE zunmqr Tile Async MORSE enum side MORSE enum trans MORSE desc t A MORSE desc t T MORSE desc t B MORSE sequence t sequence MORSE request t request
6. approaches need to be found We will specifically design sparse hybrid direct iterative methods that represent a promising approach 1 1 3 Research papers Research papers about MORSE can be found at http icl cs utk edu projectsdev morse pubs index html 1 2 CHAMELEON 1 2 1 CHAMELEON software The main purpose is to address the performance shortcomings of the LAPACK and ScaLAPACK libraries on multicore processors and multi socket systems of multicore processors and their inability to efficiently utilize accelerators such as Graphics Processing Units GPUs CHAMELEON is a framework written in C which provides routines to solve dense gen eral systems of linear equations symmetric positive definite systems of linear equations and linear least squares problems using LU Cholesky QR and LQ factorizations Real arith Chapter 1 Introduction to CHAMELEON 3 metic and complex arithmetic are supported in both single precision and double precision It supports Linux and Mac OS X machines only tested on Intel x86 64 architecture CHAMELEON is based on PLASMA source code but is not limited to shared memory environment and can exploit multiple GPUs CHAMELEON is interfaced in a generic way with both QUARK and StarPU runtime systems This feature allows to analyze in a unified framework how sequential task based algorithms behave regarding different runtime systems implementations Using CHAMELEON with StarPU runtime system allows to exploit GPU
7. ind Ai 27 4 3 1 0 Steps esce de cove ai een e 28 E ee Kerg inet an eee haat es 29 4 3 2 List of available routines aaa aa 30 4 3 2 1 Auxiliary routines aaa 30 4 3 2 2 Descriptor routines 0 eee eee eee eee eee 30 4 3 2 3 Options routines aa 31 4 3 2 4 Sequences routines aaa ol 4 3 2 5 Linear Algebra route 31 ii Chapter 1 Introduction to CHAMELEON 1 1 Introduction to CHAMELEON 1 1 MORSE project 1 1 1 MORSE Objectives When processor clock speeds flatlined in 2004 after more than fifteen years of exponential increases the era of near automatic performance improvements that the HPC application community had previously enjoyed came to an abrupt end To develop software that will perform well on petascale and exascale systems with thousands of nodes and millions of cores the list of major challenges that must now be confronted is formidable 1 dramatic escalation in the costs of intrasystem communication between processors and or levels of memory hierarchy 2 increased heterogeneity of the processing units mixing CPUs GPUs etc in varying and unexpected design combinations 3 high levels of parallelism and more complex constraints means that cooperating processes must be dynamically and un predictably scheduled for asynchronous execution 4 software will not run at scale without much better resilience to faults and far more robustness and 5 new levels of self adaptivity will be required to enable soft
8. multiply e SYRK symmetric matrix matrix rank k update e SYR2K symmetric matrix matrix rank 2k update e PEMV matrix vector multiply with pentadiagonal matrix e TRMM triangular matrix matrix multiply e TRSM triangular solve multiple rhs e POSV solve linear systems with symmetric positive definite matrix e GESV INCPIV solve linear systems with general matrix e GELS linear least squares with general matrix e timing contains timing drivers to assess performances of CHAMELEON routines There are two sets of executables those who do not use the tile interface and those who do with tile in the name of the executable Executables without tile interface allocates data following LAPACK conventions and these data can be given as arguments to Chapter 4 Using CHAMELEON 20 CHAMELEON routines as you would do with LAPACK Executables with tile interface generate directly the data in the format CHAMELEON tile algorithms used to submit tasks to the runtime system Executables with tile interface should be more performant because no data copy from LAPACK matrix layout to tile matrix layout are necessary Calling example timing time dpotrf n range 1000 10000 1000 nb 320 threads 9 gpus 3 nowarmup List of main options that can be used in timing help show usage threads Number of CPU workers default _SC_NPROCESSORS_ONLN gpus number of GPU workers default 0 n range R range of N values with R Start Stop Step defau
9. runtime synchronization barriers Keep in mind that when the tile interface is called like MORSE dpotrf Tile a synchronization function waiting for the actual execution and termination of all tasks is called to ensure the proper completion of the algorithm i e data are up to date The code shows how to exploit the async interface to pipeline subsequent algorithms so that less synchronisations are done The code becomes Morse structure containing parameters and a structure to interact with the Runtime system MORSE context t morse MORSE sequence uniquely identifies a set of asynchronous function calls sharing common exception handling MORSE_sequence_t sequence NULL MORSE request uniquely identifies each asynchronous function call MORSE_request_t request MORSE_REQUEST_INITIALIZER int status morse_sequence_create morse amp sequence Factorization MORSE_dpotrf_Tile_Async UPLO descA sequence amp request Solve MORSE_dpotrs_Tile_Async UPLO descA descX sequence amp request Synchronization barrier the runtime ensures that all submitted tasks have been terminated RUNTIME_barrier morse Ensure that all data processed on the gpus we are depending on are back in main memory RUNTIME desc getoncpu descA RUNTIME desc getoncpu descX Status sequence status Here the sequence of dpotrf and dpotrs algorithms is processed without synchro
10. used without libtmg Caution about the compatibility CHAMELEON has been mainly tested with the ref erence TMG from NETLIB and the Intel MKL 11 1 from Intel distribution 2013 spl 2 1 2 0 QUARK QUARK QUeuing And Runtime for Kernels provides a library that enables the dynamic execution of tasks with data dependencies in a multi core multi socket shared memory environment One of QUARK or StarPU Runtime systems has to be enabled in order to schedule tasks on the architecture If QUARK is enabled then StarPU is disabled and con versely Note StarPU is enabled by default When CHAMELEON is linked with QUARK it is not possible to exploit neither CUDA for GPUs nor MPI distributed memory envi ronment You can use StarPU to do so Caution about the compatibility CHAMELEON has been mainly tested with the QUARK library from PLASMA release between versions 2 5 0 and 2 6 0 2 1 2 7 StarPU StarPU is a task programming library for hybrid architectures StarPU handles run time concerns such as Task dependencies Optimized heterogeneous scheduling Optimized data transfers and replication between main memory and discrete memories Optimized cluster communications StarPU can be used to benefit from GPUs and distributed memory environment One of QUARK or StarPU runtime system has to be enabled in order to schedule tasks on the architecture If StarPU is enabled then QUARK is disabled and conversely Note StarPU is enabled by default
11. 2 export LD LIBRARY PATH path to libs path to chameleon lib 3 Build a Fortran program with CHAMELEON CHAMELEON provides a Fortran interface to user functions Example call morse version major minor patch or call MORSE VERSION major minor patch Build and link are very similar to the C case Compilation example gfortran o main o c main c Static linking example gfortran main o o main home yourname install chameleon lib libchameleon a home yourname install chameleon lib libchameleon starpu a home yourname install chameleon lib libcoreblas a lstarpu 1 1 Wl no as needed lmkl intel lp64 lmkl sequential lmkl core lpthread Im lrt n Dynamic linking example 4 3 gfortran main o o main L home yourname install chameleon lib lchameleon lchameleon starpu lcoreblas lstarpu 1 1 Wl no as needed lmkl intel lp64 lmkl sequential lmkl core lpthread Im lrt N N N N CHAMELEON API CHAMELEON provides routines to solve dense general systems of linear equations sym metric positive definite systems of linear equations and linear least squares problems using LU Cholesky QR and LQ factorizations Real arithmetic and complex arithmetic are sup ported in both single precision and double precision Routines that compute linear algebra are of the folowing form MORSE name Tile Asyncl all user routines are prefixed with MORSE name follows BLAS LAPACK naming scheme for algori
12. BLAS LAPACK CBLAS LAPACKE pthread m math and rt These libraries will depend on the configuration of your CHAMELEON build You can find these dependencies in pc files we generate during compilation and that are installed in the sub directory lib pkgconfig of your CHAMELEON install directory Note also that you could need to specify where to find these libraries with L option of your compiler linker Before to run your program make sure that all shared libraries paths your executable depends on are known Enter 1dd main to check If some shared libraries paths are missing append them in the LD_LIBRARY_PATH for Linux systems environment variable DYLD_ LIBRARY PATH on Mac LIB on Windows 4 2 2 Dynamic linking in C For dynamic linking need to build CHAMELEON with CMake option BUILD SHARED LIBS ON it is similar to static compilation link but instead of specifying path to your static libraries you indicate the path to dynamic libraries with L option and you give the name of libraries with 1 option like this gcc main o o main N Chapter 4 Using CHAMELEON 22 L home yourname install chameleon lib N lchameleon lchameleon starpu lcoreblas N lstarpu 1 1 Wl no as needed lmkl intel 1p64 lmkl sequential lmkl core lpthread Im lrt Note that an update of your environment variable LD LIBRARY PATH DYLD LIBRARY PATH on Mac LIB on Windows with the path of the libraries could be required before executing example 4
13. CHAMELEON User s Guide Software of MORSE project A dense linear algebra software for heterogeneous architectures Version 0 9 1 Inria University of Tennessee University of Colorado Denver King Abdullah University of Science and Technology Copyright 2014 Inria Copyright 2014 The University of Tennessee Copyright 2014 King Abdullah University of Science and Technology Redistribution and use in source and binary forms with or without modifica tion are permitted provided that the following conditions are met e hedistributions of source code must retain the above copyright notice this list of conditions and the following disclaimer e hedistributions in binary form must reproduce the above copyright notice this list of conditions and the following disclaimer listed in this license in the documentation and or other materials provided with the distribution e Neither the name of the copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission This software is provided by the copyright holders and contributors as is and any express or implied warranties including but not limited to the implied warranties of merchantability and fitness for a particular purpose are disclaimed In no event shall the copyright owner or contributors be liable for any direct indirect incidental special exemplary or consequential dama
14. E enum transA MORSE enum diag MORSE_Complex64_t alpha MORSE desc t x MORSE desc t B int MORSE ztrtri Tile MORSE enum uplo MORSE enum diag MORSE desc t A int MORSE zunglq Tile MORSE desc t A MORSE desc t T MORSE desc t B int MORSE zungqr Tile MORSE desc t A MORSE desc t T MORSE desc t B int MORSE zunmlq Tile MORSE enum side MORSE enum trans MORSE desc t x MORSE desc t T MORSE desc t B int MORSE zunmqr Tile MORSE enum side MORSE enum trans MORSE desc t x MORSE desc t T MORSE desc t B Jk GRR CRI AK Declarations of computational functions tile layout asynchronous execution x int MORSE_zgelqf_Tile_Async MORSE_desc_t A MORSE desc t st MORSE sequence t sequence MORSE request t request int MORSE gelos Tile Async MORSE desc t x MORSE desc t st MORSE desc t B MORSE sequence t sequence MORSE request t request int MORSE zgels Tile Async MORSE enum trans MORSE desc t x MORSE desc t T MORSE desc t B MORSE sequence t sequence MORSE request t request int MORSE zgemm Tile Async MORSE enum transA MORSE enum transB MORSE Complex64 t alpha MORSE desc t x MORSE desc t xB MORSE Complex64 t beta MORSE desc t C MORSE sequence t sequence MORSE request t request int MORSE zgeqrf Tile Async MORSE desc t A MORSE desc t T MORSE sequence t sequence MORSE request t request Chapter 4 Using CHAMELEON 39 int M
15. LDA MORSE Complex64 t B int LDB int MORSE zgetrf incpiv int M int N MORSE Complex64 t A int LDA MORSE_desc_t xdescL int IPIV int MORSE zgetrf nopiv int M int N MORSE Complex64 t x int LDA int MORSE zgetrs incpiv MORSE enum trans int N int NRHS MORSE Complex64 t A int LDA MORSE desc t xdescL int IPIV MORSE Complex64 t xB int LDB int MORSE zgetrs nopiv MORSE enum trans int N int NRHS MORSE Complex64 t A int LDA MORSE Complex64 t B int LDB tifdef COMPLEX int MORSE zhemm MORSE enum side MORSE enum uplo int M int N MORSE Complex64 t alpha MORSE Complex64 t A int LDA MORSE Complex64 t xB int LDB MORSE Complex64 t beta MORSE Complex64 t C int LDC int MORSE zherk MORSE enum uplo MORSE enum trans int N int K Chapter 4 Using CHAMELEON 33 double alpha MORSE Complex64 t xA int LDA double beta MORSE Complex64 t C int LDC int MORSE zher2k MORSE enum uplo MORSE enum trans int N int K MORSE Complex64 t alpha MORSE Complex64 t A int LDA MORSE Complex64 t xB int LDB double beta MORSE Complex64 t C int LDC tendif int MORSE zlacpy MORSE enum uplo int M int N MORSE Complex64 t xA int LDA MORSE Complex64 t B int LDB double MORSE zlange MORSE enum norm int M int N MORSE Complex64 t A int LDA tifdef COMPLEX double MORSE zlanhe MORSE enum norm MORSE enum uplo int N MORSE Complex64 t A int LDA tendif double MORSE zlansy MO
16. Linear Algebra PACKage is a software library for numerical linear algebra a successor of LINPACK and EISPACK and a predecessor of CHAMELEON LAPACK pro vides routines for solving linear systems of equations linear least square problems eigen value problems and singular value problems Most commercial and academic BLAS packages also provide some LAPACK routines Caution about the compatibility CHAMELEON has been mainly tested with the ref erence LAPACK from NETLIB and the Intel MKL 11 1 from Intel distribution 2013 spl Chapter 2 Installing CHAMELEON 8 2 1 2 4 LAPACKE LAPACKE is a C language interface to LAPACK or CLAPACK It is produced by Intel in coordination with the LAPACK team and is available in source code from Netlib in its original version Netlib LAPACKE and from CHAMELEON website in an extended version LAPACKE for CHAMELEON In addition to implementing the C interface LAPACKE also provides routines which automatically handle workspace allocation making the use of LAPACK much more convenient Caution about the compatibility CHAMELEON has been mainly tested with the ref erence LAPACKE from NETLIB A stand alone version of LAPACKE is required 2 1 2 5 libtmg libtmg is a component of the LAPACK library containing routines for generation of input matrices for testing and timing of LAPACK The testing and timing suites of LAPACK require libtmg but not the library itself Note that the LAPACK library can be built and
17. MORSE Complex64 t beta MORSE desc t C MORSE_sequence_t sequence MORSE request t request Chapter 4 Using CHAMELEON 42 int MORSE zsyr2k Tile Async MORSE enum uplo MORSE enum trans MORSE Complex64 t alpha MORSE desc t x MORSE desc t xB MORSE Complex64 t beta MORSE desc t C MORSE sequence t sequence MORSE request t request int MORSE ztrmm Tile Async MORSE enum side MORSE enum uplo MORSE enum transA MORSE enum diag MORSE Complex64 t alpha MORSE desc t x MORSE_desc_t xB MORSE sequence t sequence MORSE request t request int MORSE ztrsm Tile Async MORSE enum side MORSE enum uplo MORSE enum transA MORSE enum diag MORSE Complex64 t alpha MORSE desc t x MORSE desc t xB MORSE sequence t sequence MORSE request t request int MORSE ztrsmpl Tile Async MORSE desc t A MORSE desc t L int IPIV MORSE_desc_t B MORSE sequence t sequence MORSE request t request int MORSE ztrsmrv Tile Async MORSE enum side MORSE enum uplo MORSE enum transA MORSE enum diag MORSE Complex64 t alpha MORSE desc t x MORSE_desc_t B MORSE sequence t sequence MORSE request t request int MORSE ztrtri Tile Async MORSE enum uplo MORSE enum diag MORSE desc t x MORSE sequence t sequence MORSE request t request int MORSE zunglq Tile Async MORSE desc t A MORSE desc t T MORSE desc t B MORSE sequence t sequence MORSE request t request int MORSE zungqr
18. N to link with StarPU library runtime system DCHAMELEON SCHED QUARK trigger default OFF to link with QUARK library runtime system DCHAMELEON USE CUDA trigger default OFF to link with CUDA runtime implementation paradigm for accelerated codes on GPUs and cuBLAS library optimized BLAS kernels on GPUs can only be used with StarPU DCHAMELEON USE MAGMA trigger default OFF to link with MAGMA library kernels on GPUs higher level than cuBLAS can only be used with StarPU DCHAMELEON USE MPI trigger default OFF to link with MPI library message passing implementation for use of multiple nodes with distributed memory can only be used with StarPU DCHAMELEON USE FXT trigger default OFF to link with FxT library trace execution of kernels on workers can only be used with StarPU DCHAMELEON SIMULATION trigger default OFF to enable simulation mode means CHAMELEON will not really execute tasks see details in section Section 3 4 Use simulation mode with StarPU SimGrid page 17 This option must be used with StarPU compiled with SimGrid allowing to guess the execution time on any architecture This feature should be used to make experiments on the scheduler behaviors and performances not to produce solutions of linear systems DCHAMELEON ENABLE DOCS trigger default ON to control build of the documentation contained in docs sub directory DCHAMELEON ENABLE EXAMPLE trigger default ON to control bu
19. ON you can give the option trace to tell the program to generate trace log files Finally to generate the trace file which can be opened with Vite program you have to use the starpu fxt tool executable of StarPU This tool should be in path to your install starpu bin You can use it to generate the trace file like this e path to your install starpu bin starpu fxt tool i prof filename There is one file per mpi processus prof filename 0 prof filename 1 To generate a trace of mpi programs you can call it like this e path to your install starpu bin starpu fxt tool i prof filenamex The trace file will be named paje trace use o option to specify an output name 3 4 Use simulation mode with StarPU SimGrid Simulation mode can be enabled by setting the cmake option DCHAMELEON SIMULATION ON This mode allows you to simulate execution of algorithms with StarPU compiled with SimGrid To do so we provide some perfmodels in the simucore perfmodels directory of CHAMELEON sources To use these perfmodels please set the following e STARPU HOME environment variable to path to SOURCE DIR simucore perfmodels e STARPU HOSTNAME environment variable to the name of the machine to simulate For example on our platform PlaFRIM with GPUs at Inria Bordeaux STARPU HOSTNAME mirage Note that only POTRF kernels with block sizes of 320 or 960 simple and double precision on mirage machine are available for now Database of models is
20. ORSE zgeqrs Tile Async MORSE desc t A MORSE desc t st MORSE desc t B MORSE_sequence_t sequence MORSE request t request int MORSE zgesv incpiv Tile Async MORSE desc t x MORSE desc t L int IPIV MORSE desc t B MORSE_sequence_t sequence MORSE request t request int MORSE zgesv nopiv Tile Async MORSE desc t A MORSE desc t B MORSE sequence t sequence MORSE request t request int MORSE zgetrf incpiv Tile Async MORSE desc t A MORSE desc t L int IPIV MORSE sequence t sequence MORSE request t request int MORSE zgetrf nopiv Tile Async MORSE desc t x MORSE_sequence_t sequence MORSE request t request int MORSE zgetrs incpiv Tile Async MORSE desc t A MORSE desc t L int IPIV MORSE desc t B MORSE_sequence_t sequence MORSE request t request int MORSE zgetrs nopiv Tile Async MORSE desc t x MORSE desc t B MORSE sequence t sequence MORSE request t request tifdef COMPLEX int MORSE zhemm Tile Async MORSE enum side MORSE enum uplo MORSE Complex64 t alpha MORSE desc t x MORSE desc t xB MORSE Complex64 t beta MORSE desc t C MORSE sequence t sequence MORSE request t request int MORSE zherk Tile Async MORSE enum uplo MORSE enum trans double alpha MORSE desc t x double beta MORSE desc t C MORSE_sequence_t sequence MORSE request t request int MORSE zher2k Tile Async MORSE enum uplo MORSE enum trans MORSE Complex64 t alph
21. RSE enum norm MORSE enum uplo int N MORSE Complex64 t A int LDA double MORSE zlantr MORSE enum norm MORSE enum uplo MORSE enum diag int M int N MORSE Complex64 t A int LDA int MORSE zlaset MORSE enum uplo int M int N MORSE Complex64 t alpha MORSE Complex64 t beta MORSE Complex64 t A int LDA int MORSE zlauum MORSE enum uplo int N MORSE Complex64 t A int LDA tifdef COMPLEX int MORSE zplghe double bump int N MORSE Complex64 t A int LDA unsigned long long int seed tendif int MORSE zplgsy MORSE Complex64 t bump int N MORSE Complex64 t xA int LDA unsigned long long int seed int MORSE zplrnt int M int N MORSE Complex64 t A int LDA unsigned long long int seed int MORSE zposv MORSE enum uplo int N int NRHS MORSE Complex64 t A int LDA MORSE Complex64 t B int LDB Chapter 4 Using CHAMELEON 34 int MORSE zpotrf MORSE enum uplo int N MORSE Complex64 t A int LDA int MORSE zsytrf MORSE enum uplo int N MORSE Complex64 t A int LDA int MORSE zpotri MORSE enum uplo int N MORSE Complex64 t A int LDA int MORSE zpotrs MORSE enum uplo int N int NRHS MORSE Complex64 t xA int LDA MORSE Complex64 t xB int LDB if defined PRECISION c defined PRECISION_z int MORSE zsytrs MORSE enum uplo int N int NRHS MORSE Complex64 t xA int LDA MORSE Complex64 t xB int LDB tendif int MORSE zsymm MORSE enum side MORSE enum uplo int M int N
22. a MORSE desc t x MORSE desc t B double beta MORSE desc t C Chapter 4 Using CHAMELEON 40 MORSE sequence t sequence MORSE request t request tendif int MORSE zlacpy Tile Async MORSE enum uplo MORSE desc t x MORSE desc t B MORSE sequence t sequence MORSE request t request int MORSE zlange Tile Async MORSE enum norm MORSE desc t x double xvalue MORSE sequence t sequence MORSE request t request tifdef COMPLEX int MORSE zlanhe Tile Async MORSE enum norm MORSE enum uplo MORSE desc t A double value MORSE_sequence_t sequence MORSE request t request tendif int MORSE zlansy Tile Async MORSE enum norm MORSE enum uplo MORSE desc t A double value MORSE sequence t sequence MORSE request t request int MORSE zlantr Tile Async MORSE enum norm MORSE enum uplo MORSE enum diag MORSE desc t A double value MORSE sequence t sequence MORSE request t request int MORSE zlaset Tile Async MORSE enum uplo MORSE_Complex64_t alpha MORSE Complex64 t beta MORSE desc t x MORSE sequence t sequence MORSE request t request int MORSE zlauum Tile Async MORSE enum uplo MORSE desc t x MORSE sequence t sequence MORSE request t request tifdef COMPLEX int MORSE zplghe Tile Async double bump MORSE desc t x unsigned long long int seed MORSE sequence t sequence MORSE request t request tendif int MORSE zplgsy Tile Async MORSE Complex 4
23. acing This library provides efficient support for recording traces CHAMELEON can trace kernels execution on the different workers and produce paje files if FxT is enabled FxT can only be used through StarPU and StarPU must be compiled with FxT enabled see how to use this feature here Section 3 3 Use FxT profiling through StarPU page 17 Caution about the compatibility FxT should be compatible with the version of StarPU used 2 2 Build process of CHAMELEON 2 2 1 Setting up a build directory The CHAMELEON build process requires CMake version 2 8 0 or higher and working C and Fortran compilers Compilation and link with CHAMELEON libraries have been tested with gcc gfortran 4 8 1 and icc ifort 14 0 2 On Unix like operating systems it also requires Make The CHAMELEON project can not be configured for an in source build You will get an error message if you try to compile in source Please clean the root of your project by deleting the generated CMakeCache txt file and other CMake generated files mkdir build cd build You can create a build directory from any location you would like It can be a sub directory of the CHAMELEON base source directory or anywhere else 2 2 2 Configuring the project with best efforts cmake path to SOURCE DIR DOPTION1 DOPTION2 Kpath to SOURCE DIR represents the root of CHAMELEON project where stands the main parent CMakeLists txt file Details about options that are useful to give to c
24. ae Cohel eg Rege EE Ca 9 2 1 2 9 pthread oeste e e obp k EES 9 2 1 8 Optional dependencies aaa 9 SST OpenMP uka sve t dees t oh rex pes aaa 9 2 1 3 2 Nvidia CUDA Toolkit aaa 9 2 1 3 3 MAGMA iei ee ree ta rn ae ER 9 2134 RE eeh dutt ee 10 2 2 Build process of CHAMELEON esee 10 2 2 1 Setting up a build directory aaa 10 2 2 2 Configuring the project with best eflorts 10 2 2 3 Building 0 0 cece se 10 2 2 4 WSS ees lelsee ei exer na REO EGER eee 10 2 2 5 Installing 45 osi dpt eed de eer a 11 3 Configuring CHAMELEON 13 3 1 Compilation conbeuration ene 13 3 1 1 General CMake options aaa 13 3 12 CHAMELEON options aaa 14 3 2 Dependencies detection 0 ccc cece cee eens 16 3 3 Use FxT profiling through Star Pu aaa aaa 17 3 4 Use simulation mode with StarPU SimGrid 17 Using CHAMELEON 19 4 1 Using CHAMELEON executables 0 000 cece eee 19 4 2 Linking an external application with CHAMELEON libraries 21 4 2 1 Static linking in Ca 21 4 2 2 Dynamic linking in Ca 21 4 2 3 Build a Fortran program with CHAMELEON 22 4 3 CHAMELEON ATI 22 4 3 1 Tutorial LAPACK to CHAMELEON 23 43 11 Steph decernere ERROR ER EE e 23 ASAD Stepl cedi ceret be dure Keim aan 24 43 1 9 Step cscccekteeenn er e REG e ERR 25 AZLA Steps ossi ea RR aa ikea leeds 26 4 3 15 IO PA codecs cech cits a eee Ian
25. atrix size e number of right hand sides e block tile size The problem size is given with n and nrhs options The tile size is given with option nb These parameters are required to create descriptors The size tile NB is a key parameter to get performances since it defines the granularity of tasks If NB Chapter 4 Using CHAMELEON 29 is too large compared to N there are few tasks to schedule If the number of workers is large this leads to limit parallelism On the contrary if NB is too small i e many small tasks workers could not be correctly fed and the runtime systems operations could represent a substantial overhead A trade off has to be found depending on many parameters problem size algorithm drive data dependencies architecture number of workers workers speed workers uniformity memory bus speed By default it is set to 128 Do not hesitate to play with this parameter and compare performances on your machine e inner blocking size The inner blocking size is given with option ib This parameter is used by kernels optimized algorithms applied on tiles to perform subsequent operations with data block size that fits the cache of workers Parameters NB and IB can be given with MORSE Set function MORSE Set MORSE TILE SIZE iparam IPARAM NB MORSE Set MORSE INNER BLOCK SIZE iparam IPARAM IB 4 3 1 7 Step6 This program is a copy of Step5 with some additional parameters to be set for the data distr
26. desc t xdescT MORSE Complex64 t B int LDB int MORSE zungqr int M int N int K MORSE Complex64 t A int LDA MORSE desc t xdescT MORSE Complex64 t xB int LDB int MORSE zunmlq MORSE enum side MORSE enum trans int M int N int K MORSE Complex64 t xA int LDA MORSE desc t descT MORSE Complex64 t xB int LDB int MORSE zunmqr MORSE enum side MORSE enum trans int M int N int K MORSE Complex64 t A int LDA MORSE desc t descT MORSE Complex64 t xB int LDB f CK KKK K 21 21 21 K 21 1 21 2 3K 3K AK Declarations of computational functions tile layout L d int MORSE zgelqf Tile MORSE desc t A MORSE desc t T int MORSE zgelqs Tile MORSE desc t A MORSE desc t T MORSE desc t B int MORSE zgels Tile MORSE enum trans MORSE desc t A MORSE desc t T MORSE desc t xB int MORSE zgemm Tile MORSE enum transA MORSE enum transB MORSE Complex64 t alpha MORSE desc t x MORSE desc t xB MORSE Complex64 t beta MORSE desc t C int MORSE zgeqrf Tile MORSE desc t A MORSE desc t sti int MORSE zgeqrs Tile MORSE desc t A MORSE desc t T MORSE desc t B int MORSE zgesv incpiv Tile MORSE desc t A MORSE desc t L int IPIV MORSE desc t B Chapter 4 Using CHAMELEON 36 int MORSE zgesv nopiv Tile MORSE desc t x MORSE desc t B int MORSE zgetrf incpiv Tile MORSE desc t A MORSE desc t L int IPIV int MORSE zgetrf nopiv Tile MORSE desc t A int MORSE zgetrs incp
27. e Destroy MORSE sequence t sequence Wait for the completion of a sequence int MORSE Sequence Wait MORSE sequence t sequence 4 3 2 5 Linear Algebra routines Routines computing linear algebra of the form MORSE name Tile Async name fol lows LAPACK naming scheme see http www netlib org lapack lug node24 html availables DER kok oo ooo oo keck ek Declarations of computational functions LAPACK layout xk int MORSE zgelqf int M int N MORSE Complex64 t A int LDA MORSE desc t descT Chapter 4 Using CHAMELEON 32 int MORSE zgelqs int M int N int NRHS MORSE Complex64 t A int LDA MORSE desc t xdescT MORSE Complex64 t xB int LDB int MORSE zgels MORSE enum trans int M int N int NRHS MORSE Complex64 t A int LDA MORSE desc t descT MORSE Complex64 t B int LDB int MORSE zgemm MORSE enum transA MORSE enum transB int M int N int K MORSE_Complex64_t alpha MORSE Complex64 t A int LDA MORSE Complex64 t xB int LDB MORSE Complex64 t beta MORSE Complex64 t C int LDC int MORSE zgeqrf int M int N MORSE Complex64 t A int LDA MORSE desc t descT int MORSE zgeqrs int M int N int NRHS MORSE Complex64 t A int LDA MORSE desc t xdescT MORSE Complex64 t B int LDB int MORSE zgesv incpiv int N int NRHS MORSE Complex64 t A int LDA MORSE desc t xdescL int IPIV MORSE Complex64 t B int LDB int MORSE zgesv nopiv int N int NRHS MORSE Complex64 t A int
28. e User amp descA mat MorseRealDouble NB NB NB NB N N 0 0 N N 1 1 user getaddr arrayofpointers user getblkldd arrayofpointers user getrankof zero Firsts arguments are the same than MORSE Desc Create routine Following arguments allows you to give pointer to functions that manage the access to tiles from the structure given as second argument Here for example mat is an array containing addresses to tiles see the function allocate tile matrix defined in step3 h The three functions you have to define for Desc Create User are e a function that returns address of tile A m n m and n standing for the indexes of the tile in the global matrix Lets consider a matrix 4x4 with tile size 272 the matrix contains four tiles of indexes Alm 0 n 0 A m 0 n 1 A m 1 n 0 A m 1 n 1 e a function that returns the leading dimension of tile A m x e a function that returns MPI rank of tile A m n Examples for these functions are vizible in step3 h Note that the way we define these functions is related to the tile matrix format and to the data distribution considered This example should not be used with MPI since all tiles are affected to processus 0 which means a large amount of data will be potentially transfered between nodes Chapter 4 Using CHAMELEON 27 4 3 1 5 Step4 This program is a copy of step2 but instead of using the tile interface it uses the tile async interface The goal is to exhibit the
29. f values in a tile NB NB the number of rows N colmumns N in the entire matrix Chapter 4 Using CHAMELEON 26 e Seventh argument until ninth argument stand for respectively the beginning row 0 column 0 indexes of the submatrix and the number of rows N columns N in the submatrix These arguments are specific and used in precise cases If you do not consider submatrices just use 0 0 NROWS NCOLS e Two last arguments are the parameter of the 2 D block cyclic distribution grid see ScaLAPACK To be able to use other data distribution over the nodes MORSE Desc Create User function should be used 4 3 1 4 Step3 This program makes use of the same interface than Step2 tile interface but does not allocate LAPACK matrices anymore so that no copy between LAPACK matrix layout and tile matrix layout are necessary to call MORSE routines To generate random right hand sides you can use Allocate memory and initialize descriptor B MORSE Desc Create amp descB NULL MorseRealDouble NB NB NB NB N NRHS 0 0 N NRHS 1 1 generate RHS with random values MORSE dplrnt Tile desch 5673 The other important point is that is it possible to create a descriptor the necessary structure to call MORSE efficiently by giving your own pointer to tiles if your matrix is not organized as a 1 D array column major This can be achieved with the MORSE Desc Create User routine Here is an example MORSE Desc Creat
30. ges including but not limited to procurement of substitute goods or services loss of use data or profits or business interruption however caused and on any theory of liability whether in contract strict liability or tort including negligence or otherwise arising in any way out of the use of this software even if advised of the possibility of such damage Table of Contents 1 Introduction to CHAMELEON 1 Ll MORSE project eere ea an 1 1 1 4 MORSE Objectives e 1 1 1 2 Research fields aa 1 1 1 2 1 Fine interaction between linear algebra and runtime SYSTEMS ni FRU e Ep eae 1 1 1 2 2 Runtime systems aaa 2 1 1 2 3 Linear algebra eens 2 1 1 8 Research paper 2 1 2 CHAMELEON ies censere ERO ee di de a bas 2 1 2 1 CHAMELEON software 2 1 2 2 PLASMA s design principe 3 1 2 2 1 Tile Algorithms aaa aaa 3 1 2 2 2 Tile Data Lawout esses 4 1 2 2 3 Dynamic Task Schedulin gp aaa aaa 5 2 Installing CHAMELEON T 2 1 Downloading CHAMELEON I 7 2 1 1 Getting Sources 0 cece cece eee nee eas 7 2 1 2 Required dependencies 0c cece eee eens e 2 1 2 1 a BLAS implementation aaa aaa aaa 1 21 2 2 CBDA S acies eut Sea een d 2 1 2 3 a LAPACK implementation aaa aaa 7 2 1 2 4 LGAPACKE puc E dd DO pepe 8 2 1 2 5 JI Dtm 2 ia bog Stegen geg deed Eltere 8 212 6 QUARK epos RHDEpCEREMER GO ORAUE dn 8 212 7 StarPU cies ede aed d sere en 8 2 1 2 8 DWOG uu audi sd de b
31. gorithm to each core and following a cycle transition to a task wait for its dependencies execute it update the overall progress Task are identified by tuples and task transitions are done through locally evaluated formulas Progress information can be centralized replicated or distributed currently centralized B SGEQRT 7 SLARFB lees A trace of the tile QR factorization executing on eight cores without any global syn chronization points kernel names for real arithmetics in single precision courtesey of the PLASMA team Chapter 2 Installing CHAMELEON 7 2 Installing CHAMELEON CHAMELEON can be built and installed by the standard means of CMake http www cmake org General information about CMake as well as installation binaries and CMake source code are available from http www cmake org cmake resources software html The following chapter is intended to briefly remind how these tools can be used to install CHAMELEON 2 1 Downloading CHAMELEON 2 1 1 Getting Sources The latest official release tarballs of CHAMELEON sources are available for download from chame
32. gui installed on a Linux system 3 invoque cmake gui command and fill information about the location of the sources and where to build the project then you have access to options through a user friendly Qt interface required cmake qt gui installed on a Linux system Example of configuration using the command line cmake chameleon DCMAKE BUILD TYPE Debug N DCMAKE INSTALL PREFIX install A DCHAMELEON USE CUDA ON N DCHAMELEON USE MAGMA ON DCHAMELEON_USE_MPI 0N DBLA VENDOR Intel10 641p N DSTARPU DIR install starpu 1 1 DCHAMELEON USE FXT ON You can get the full list of options with L A H options of cmake command cmake LH path to source directory 3 1 1 General CMake options DCMAKE INSTALL PREFIX path default path usr local Install directory used by make install where some headers and libraries will be copied Permissions have to be granted to write onto path during make install step DCMAKE BUILD TYPE var default Release Define the build type and the compiler optimization level The possible values for var are empty Debug Release Chapter 3 Configuring CHAMELEON 14 RelWithDebInfo MinSizeRel DBUILD SHARED LIBS trigger default OFF Indicate wether or not CMake has to build CHAMELEON static OFF or shared ON libraries 3 1 2 CHAMELEON options List of CHAMELEON options that can be enabled disabled value ON or OFF DCHAMELEON_SCHED_STARPU trigger default O
33. ibution To use this program properly MORSE must use StarPU Runtime system and MPI option must be activated at configure The data distribution used here is 2 D block cyclic see for example ScaLAPACK for explanation The user can enter the parameters of the distribution grid at execution with p option Example using OpenMPI on four nodes with one process per node mpirun np 4 step6 n 10000 nb 320 ib 64 N threads 8 gpus 2 p 2 In this program we use the tile data layout from PLASMA so that the call MORSE Desc Create User amp descA NULL MorseRealDouble NB NB NB NB N N 0 0 N N GRID P GRID_Q morse getaddr ccrb morse getblkldd ccrb morse getrankof 2d is equivalent to the following call MORSE Desc Create amp descA NULL MorseRealDouble NB NB NB NB N N 0 0 N N GRID P GRID_Q functions morse getaddr ccrb morse getblkldd ccrb morse getrankof 2d being used in Desc Create It is interesting to notice that the code is almost the same as Step5 The only additional information to give is the way tiles are distributed through the third function given to MORSE Desc Create User Here because we have made experiments only with a 2 D block cyclic distribution we have parameters P and Q in the interface of Desc Create but they have sense only for 2 D block cyclic distribution and then using Chapter 4 Using CHAMELEON 30 morse getrankof 2d function Of course it could be used with other dist
34. ild of the examples executables API usage contained in example sub directory DCHAMELEON ENABLE TESTING trigger default ON to control build of testing executables numerical check contained in testing sub directory DCHAMELEON ENABLE TIMING trigger default ON to control build of timing executables performances check contained in timing sub directory Chapter 3 Configuring CHAMELEON 15 DCHAMELEON PREC S trigger default ON to enable the support of simple arithmetic precision float in C DCHAMELEON PREC D trigger default ON to enable the support of double arithmetic precision double in C DCHAMELEON PREC C trigger default ON to enable the support of complex arithmetic precision complex in C DCHAMELEON PREC Z trigger default ON to enable the support of double complex arithmetic precision double complex in C DBLAS VERBOSE trigger default OFF to make BLAS library discovery verbose DLAPACK VERBOSE trigger default OFF to make LAPACK library discovery verbose automatically enabled if BLAS_ VERBOSE ON List of CHAMELEON options that needs a specific value DBLA VENDOR var default empty The possible values for var are empty all Intel10 641p Inteli0 641p seq ACML Apple Generic to force CMake to find a specific BLAS library see the full list of BLA VENDOR in FindBLAS cmake in cmake modules morse find By default BLA VENDOR is empty so that CMake tries to detect al
35. iv Tile MORSE desc t A MORSE desc t L int IPIV MORSE desc t xB int MORSE zgetrs nopiv Tile MORSE desc t A MORSE desc t B tifdef COMPLEX int MORSE zhemm Tile MORSE enum side MORSE enum uplo MORSE Complex64 t alpha MORSE desc t x MORSE desc t xB MORSE Complex64 t beta MORSE desc t C int MORSE zherk Tile MORSE enum uplo MORSE enum trans double alpha MORSE desc t x double beta MORSE desc t C int MORSE zher2k Tile MORSE enum uplo MORSE enum trans MORSE Complex64 t alpha MORSE desc t x MORSE desc t xB double beta MORSE desc t xC tendif int MORSE zlacpy Tile MORSE enum uplo MORSE desc t x MORSE desc t B double MORSE zlange Tile MORSE enum norm MORSE desc t A tifdef COMPLEX double MORSE zlanhe Tile MORSE enum norm MORSE enum uplo MORSE desc t A tendif double MORSE zlansy Tile MORSE enum norm MORSE enum uplo MORSE desc t A double MORSE zlantr Tile MORSE enum norm MORSE enum uplo MORSE enum diag MORSE desc t A int MORSE zlaset Tile MORSE enum uplo MORSE Complex64 t alpha MORSE Complex64 t beta MORSE desc t A int MORSE zlauum Tile MORSE enum uplo MORSE desc t x tifdef COMPLEX int MORSE zplghe Tile double bump MORSE desc t x unsigned long long int seed Chapter 4 Using CHAMELEON tendif int MORSE zplgsy Tile MORSE Complex64 t bump MORSE desc t x unsigned long long int seed int MORSE zplrnt Tile MORSE desc t A unsigned long
36. l possible BLAS vendor with a preference for Intel MKL List of CHAMELEON options which requires to give a path DLIBNAME DIR path default empty root directory of the LIBNAME library installation DLIBNAME INCDIR path default empty directory of the LIBNAME library headers installation DLIBNAME LIBDIR path default empty directory of the LIBNAME libraries so a dylib etc installation LIBNAME can be one of the following BLAS CBLAS FXT HWLOC LAPACK LAPACKE MAGMA QUARK STARPU TMG See paragraph about Section 3 2 Dependencies detection page 16 for details Chapter 3 Configuring CHAMELEON 16 Libraries detected with an official CMake module see module files in CMAKE_ ROOT Modules e CUDA e MPI e Threads Libraries detected with CHAMELEON cmake modules see module files in cmake_ modules morse find directory of CHAMELEON sources e BLAS e CBLAS e FXT e HWLOC e LAPACK e LAPACKE e MAGMA e QUARK e STARPU e TMG 3 2 Dependencies detection You have different choices to detect dependencies on your system either by setting some environment variables containing paths to the libs and headers or by specifying them directly at cmake configure Different cases 1 detection of dependencies through environment variables e LD LIBRARY PATH environment variable should contain the list of paths where to find the libraries export LD LIBRARY PATH LD LIBRARY PATH path to your
37. l to MORSE routines MORSE Init has to be invoked to initialize MORSE and the runtime system Example MORSE_Init NCPU NGPU After all MORSE calls have been done a call to MORSE Finalize is required to free some data and finalize the runtime and or MPI MORSE Finalize We use MORSE routines with the LAPACK interface which means the routines accepts the same matrix format as LAPACK 1 D array column major Note that we copy the Chapter 4 Using CHAMELEON 25 matrix to get it in our own tile structures see details about this format here Section 1 2 2 2 Tile Data Layout page 4 This means you can get an overhead coming from copies 4 3 1 3 Step2 This program is a copy of step but instead of using the LAPACK interface which leads to copy LAPACK matrices inside MORSE routines we use the tile interface We will still use standard format of matrix but we will see how to give this matrix to create a MORSE descriptor a structure wrapping data on which we want to apply sequential task based algorithms The solving code becomes Factorization MORSE dpotrf Tile UPLO descA Solve MORSE dpotrs Tile UPLO descA descX To use the tile interface a specific structure MORSE desc t must be created This can be achieved from different ways 1 Use the existing function MORSE Desc Create means the matrix data are considered contiguous in memory as it is considered in PLASMA Section 1 2 2 2 Tile Data Lay out
38. leon 0 9 1 2 1 2 Required dependencies 2 1 2 1 a BLAS implementation BLAS Basic Linear Algebra Subprograms are a de facto standard for basic linear algebra operations such as vector and matrix multiplication FORTRAN implementation of BLAS is available from Netlib Also C implementation of BLAS is included in GSL GNU Sci entific Library Both these implementations are reference implementation of BLAS are not optimized for modern processor architectures and provide an order of magnitude lovver performance than optimized implementations Highly optimized implementations of BLAS are available from many hardware vendors such as Intel MKL and AMD ACML Fast im plementations are also available as academic packages such as ATLAS and Goto BLAS The standard interface to BLAS is the FORTRAN interface Caution about the compatibility CHAMELEON has been mainly tested with the ref erence BLAS from NETLIB and the Intel MKL 11 1 from Intel distribution 2013 spl 2 1 2 2 CBLAS CBLAS is a C language interface to BLAS Most commercial and academic implementations of BLAS also provide CBLAS Netlib provides a reference implementation of CBLAS on top of FORTRAN BLAS Netlib CBLAS Since GSL is implemented in C it naturally provides CBLAS Caution about the compatibility CHAMELEON has been mainly tested with the ref erence CBLAS from NETLIB and the Intel MKL 11 1 from Intel distribution 2013 spl 2 1 2 3 a LAPACK implementation LAPACK
39. libs e INCLUDE environment variable should contain the list of paths where to find the header files of libraries export INCLUDE INCLUDE path to your headers 2 detection with user s given paths e you can specify the path at cmake configure by invoking cmake path to SOURCE DIR DLIBNAME DIR path to your lib where LIB stands for the name of the lib to look for example cmake path to SOURCE DIR DSTARPU_DIR path to starpudir DCBLAS_DIR e it is also possible to specify headers and library directories separately example Chapter 3 Configuring CHAMELEON 17 cmake path to SOURCE DIR N DSTARPU INCDIR path to libstarpu include starpu 1 1 DSTARPU LIBDIR path to libstarpu lib e Note BLAS and LAPACK detection can be tedious so that we provide a verbose mode Use DBLAS_VERBOSE ON or DLAPACK VERBOSE ON to enable it 3 3 Use FxT profiling through StarPU StarPU can generate its own trace log files by compiling it with the with fxt option at the configure step you can have to specify the directory where you installed FxT by giving with fxt instead of with fxt alone By doing so traces are generated after each execution of a program which uses StarPU in the directory pointed by the STARPU_ FXT_PREFIX environment variable Example export STARPU FXT PREFIX home yourname fxt files When executing a timing CHAMELEON program if it has been enabled StarPU compiled with FxT and DCHAMELEON USE FXT
40. long int seed int MORSE zposv Tile MORSE enum uplo MORSE desc t x MORSE desc t B int MORSE zpotrf Tile MORSE enum uplo MORSE desc t x int MORSE zsytrf Tile MORSE enum uplo MORSE desc t x int MORSE zpotri Tile MORSE enum uplo MORSE desc t x int MORSE zpotrs Tile MORSE enum uplo MORSE desc t x MORSE desc t B if defined PRECISION c defined PRECISION z int MORSE zsytrs Tile MORSE enum uplo MORSE desc t x MORSE desc t B tendif int MORSE zsymm Tile MORSE enum side MORSE enum uplo MORSE Complex64 t alpha MORSE desc t x MORSE desc t B MORSE Complex64 t beta MORSE desc t C int MORSE zsyrk Tile MORSE enum uplo MORSE enum trans MORSE Complex64 t alpha MORSE desc t x MORSE Complex64 t beta MORSE desc t C int MORSE zsyr2k Tile MORSE enum uplo MORSE enum trans MORSE_Complex64_t alpha MORSE desc t x MORSE desc t B MORSE Complex64 t beta MORSE desc t C int MORSE ztrmm Tile MORSE enum side MORSE enum uplo MORSE enum transA MORSE enum diag MORSE Complex64 t alpha MORSE desc t x MORSE desc t B int MORSE ztrsm Tile MORSE enum side MORSE enum uplo MORSE enum transA MORSE enum diag MORSE Complex64 t alpha MORSE desc t x MORSE desc t B int MORSE ztrsmpl Tile MORSE desc t x MORSE desc t L int xIPIV MORSE desc t B Chapter 4 Using CHAMELEON 38 int MORSE ztrsmrv Tile MORSE enum side MORSE enum uplo MORS
41. lt 500 5000 500 m X dimension M of the matrices default N k X dimension K of the matrices default 1 useful for GEMM algorithm k is the shared dimension and must be defined 21 to consider matrices and not vectors nrhs X number of right hand size default 1 nb X block tile size default 128 ib X inner blocking IB size default 32 niter X number of iterations performed for each test default 1 rhblk X if X gt 0 enable Householder mode for QR and LQ factorization X is the size of each subdomain default 0 no check check result default nocheck no profile print profiling informations default noprofile no trace enable disable trace generation default notrace no dag enable disable DAG generation default nodag no inv check on inverse default noinv nocpu all GPU kernels are exclusively executed on GPUs default 0 List of timing algorithms available LANGE norms of matrices GEMM general matrix matrix multiply TRSM triangular solve POTRF Cholesky factorization with a symmetric positive definite matrix POSV solve linear systems with symmetric positive definite matrix GETRF NOPIV LU factorization of a general matrix using the tile LU algorithm without row pivoting GESV NOPIV solve linear system for a general matrix using the tile LU algorithm without row pivoting Chapter 4 Using CHAMELEON 21 e GETRF_INCPIV LU factorizatio
42. make path to SOURCE DIR are given in Section 3 1 Compilation configuration page 13 2 2 3 Building make j ncores do not hesitate to use j ncores option to speedup the compilation 2 2 4 Tests In order to make sure that CHAMELEON is working properly on the system it is also possible to run a test suite make check Or ctest Chapter 2 Installing CHAMELEON 11 2 2 5 Installing In order to install CHAMELEON at the location that was specified during configuration make install do not forget to specify the install directory with DCMAKE INSTALL PREFIX at cmake configure cmake path to SOURCE DIR DCMAKE INSTALL PREFIX path to INSTALL DIR Note that the install process is optional You are free to use CHAMELEON binaries compiled in the build directory Chapter 3 Configuring CHAMELEON 13 3 Configuring CHAMELEON 3 1 Compilation configuration The following arguments can be given to the cmake path to source directory script In this chapter the following convention is used e path is a path in your filesystem e var is a string and the correct value or an example will be given e trigger is an CMake option and the correct value is ON or OFF Using CMake there are several ways to give options 1 directly as CMake command line arguments 2 invoque cmake path to source directory once and then use ccmake path to source directory to edit options through a minimalist gui required cmake curses
43. ments the scheduler is centralized and becomes a bottleneck as soon as too many cores are involved It is there fore required to distribute the scheduling decision or to compute a data distribution that impose the mapping of task using for instance the so called owner compute rule Ex pected advances We will design runtime systems that enable an efficient and scalable use of thousands of distributed multicore nodes enhanced with accelerators 1 1 2 3 Linear algebra Because of its central position in HPC and of the well understood structure of its algorithms dense linear algebra has often pioneered new challenges that HPC had to face Again dense linear algebra has been in the vanguard of the new era of petascale computing with the design of new algorithms that can efficiently run on a multicore node with GPU accelerators These algorithms are called communication avoiding since they have been redesigned to limit the amount of communication between processing units and between the different levels of memory hierarchy They are expressed through Direct Acyclic Graphs DAG of fine grained tasks that are dynamically scheduled Expected advances First we plan to investigate the impact of these principles in the case of sparse applications whose algorithms are slightly more complicated but often rely on dense kernels Furthermore both in the dense and sparse cases the scalability on thousands of nodes is still limited new numerical
44. more memory for auxiliary data Also the tile LU factorization applies a different pivoting pattern and as a result is less numerically stable than classic LU with full pivoting Numerical stability is Chapter 1 Introduction to CHAMELEON 4 not an issue in case of the tile QR which relies on orthogonal transformations Householder reflections which are numerically stable pi NE Pi DGESSM DGESSM INT IR INE DTSTRF DSSSSM DSSSSM P2 P2 Si EN Alb A Lr Lr DTSTRF DSSSSM DSSSSM KILI SI NI Schematic illustration of the tile LU factorization kernel names for real arithmetics in double precision courtesey of the PLASMA team 1 2 2 2 Tile Data Layout Tile layout is based on the idea of storing the matrix by square tiles of relatively small size such that each tile occupies a continuous memory region This way a tile can be loaded to the cache memory efficiently and the risk of evicting it from the cache memory before it is completely processed is minimized Of the three types of cache misses compulsory capacity and conflict the use of tile layout minimizes the number of conflict misses since a continuous region of memory will completely fill out a set associative cache memory before an eviction can happen Also from the standpoint of multithreaded execution the probability of false sharing is minimized It can only affect the cache lines containing the beginning and the e
45. mputations on GPUs The cuBLAS library is an implementation of BLAS Basic Linear Algebra Subprograms on top of the Nvidia CUDA runtime cuBLAS is normaly distributed with Nvidia CUDA Toolkit CUDA cuBLAS can be enabled in CHAMELEON only if the runtime system chosen is StarPU default To use CUDA through StarPU it is necessary to compile StarPU with CUDA enabled Caution about the compatibility CHAMELEON has been mainly tested with CUDA releases from versions 4 to 6 MAGMA library must be compatible with CUDA 2 1 3 3 MAGMA MAGMA project aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous hybrid architectures starting with current MulticoretGPU systems CHAMELEON can use a set of high level MAGMA routines to accelerate computations on GPUs To fully benefit from GPUs the user should enable MAGMA in addition to CUDA cuBLAS Caution about the compatibility CHAMELEON has been mainly tested with MAGMA releases from versions 1 4 to 1 6 MAGMA library must be compatible with CUDA MAGMA library should be built with sequential versions of BLAS LAPACK We should not get some MAGMA link flags embarking multithreaded BLAS LAPACK because it Chapter 2 Installing CHAMELEON 10 could affect permformances take care about the MAGMA link flag 1mkl intel thread for example that we could heritate from the pkg config file magma pc 2 1 3 4 FxT FxT stands for both FKT Fast Kernel Tracing and FUT Fast User Tr
46. n of a general matrix using the tile LU algorithm with partial tile pivoting with row interchanges e GESV INCPIV solve linear system for a general matrix using the tile LU algo rithm with partial tile pivoting with row interchanges matrix e GEQRF QR factorization of a general matrix e GELS solves overdetermined or underdetermined linear systems involving a gen eral matrix using the QR or the LQ factorization 4 2 Linking an external application with CHAMELEON libraries Compilation and link with CHAMELEON libraries have been tested with gcc gfortran 4 8 1 and icc ifort 14 0 2 4 2 1 Static linking in C Lets imagine you have a file main c that you want to link with CHAMELEON static libraries Lets consider home yourname install chameleon is the install directory of CHAMELEON containing sub directories include and lib Here could be your compi lation command with gcc compiler gcc I home yourname install chameleon include o main o c main c Now if you want to link your application with CHAMELEON static libraries you could do gcc main o o main home yourname install chameleon lib libchameleon a home yourname install chameleon lib libchameleon starpu a home yourname install chameleon lib libcoreblas a lstarpu 1 1 Wl no as needed lmkl intel lp64 lmkl sequential lmkl core lpthread Im lrt POPE w uu As you can see in this example we also link with some dynamic libraries starpu 1 1 Intel MKL libraries for
47. nding of a tile In standard cache based architecture tiles continously laid out in memory maximize the profit from automatic prefetching Tile layout is also beneficial in situations involving the use of accelerators where explicit communication of tiles through DMA transfers is required such as moving tiles between the system memory and the local store in Cell B E or moving tiles between the host memory and the device memory in GPUs In most circumstances tile layout also minimizes the number of TLB misses and conflicts to memory banks or partitions With the standard column major layout access to each column of a tile is much more likely to cause a conflict miss a false sharing miss a TLB miss or a bank or partition conflict The use of the standard layout for dense matrix operations is a performance minefield Although occasionally one can pass through it unscathed the risk of hitting a spot deadly to performance is very high Another property of the layout utilized in PLASMA is that it is flat meaning that it does not involve a level of indirection Each tile stores a small square submatrix of the main matrix in a column major layout In turn the main matrix is an arrangement of tiles immediately following one another in a column major layout The offset of each Chapter 1 Introduction to CHAMELEON 5 tile can be calculated through address arithmetics and does not involve pointer indirection Alternatively a matrix could be
48. nization so that some tasks of dpotrf and dpotrs can be concurently executed which could increase performances The async interface is very similar to the tile one It is only necessary to give two new objects MORSE sequence t and MORSE request t used to handle asynchronous function calls Chapter 4 Using CHAMELEON 28 nm m mmm u mmm em mm s ron sek Da pE nt LIED pis p i p r J potri pot rf i i b trtri i d auum POTRI POTRF TRTRI LAUUM algorithm with and without synchronization bar riers courtesey of the PLASMA team 4 3 1 6 Step5 Step5 shows how to set some important parameters This program is a copy of Step4 but some additional parameters are given by the user The parameters that can be set are e number of Threads e number of GPUs The number of workers can be given as argument to the executable with threads and gpus options It is important to notice that we assign one thread per gpu to optimize data transfer between main memory and devices memory The number of workers of each type CPU and CUDA must be given at MORSE Init if iparam IPARAM THRDNBR 1 1 get thread count amp iparam IPARAM THRDNBR reserve one thread par cuda device to optimize memory transfers iparam IPARAM THRDNBR iparam IPARAM_NCUDAS NCPU iparam IPARAM_THRDNBR NGPU iparam IPARAM_NCUDAS initialize MORSE with main parameters MORSE Init NCPU NGPU e m
49. numbers if you do not have an infinite amount of RAM As for every step the correctness of the solution is checked by calculating the norm Az B A x B The time spent in factorizationtsolve is Chapter 4 Using CHAMELEON 24 recorded and because we know exactly the number of operations of these algorithms we deduce the number of operations that have been processed per second in GFlops s The important part of the code that solves the problem is Cholesky factorization A is replaced by its factorization L or L T depending on uplo LAPACKE dpotrf LAPACK COL MAJOR U N A N Solve B is stored in X on entry X contains the result on exit Forward cblas dtrsm CblasColMajor CblasLeft CblasUpper CblasConjTrans CblasNonUnit N NRHS 1 0 A N X N and back substitution cblas dtrsm CblasColMajor CblasLeft CblasUpper CblasNoTrans CblasNonUnit N NRHS 1 0 A N X N 4 3 1 2 Stepl It introduces the simplest CHAMELEON interface which is equivalent to CBLAS LAPACKE The code is very similar to stepo but instead of calling CBLAS LAPACKE functions we call CHAMELEON equivalent functions The solving code becomes Factorization MORSE dpotrf UPLO N A N Solve MORSE dpotrs UPLO N NRHS A N X N The API is almost the same so that it is easy to use for beginners It is important to keep in mind that before any cal
50. r keep up with hardware trends whose complexity is growing exponentially One major task in this project is to define a proper interface between HPC applications and runtime systems in order to maximize productivity and expressivity As mentioned in the next section a widely used approach consists in abstracting the application as a DAG that the runtime system is in charge of scheduling Scheduling such a DAG over a set of heterogeneous processing units introduces a lot of new challenges such as predicting accurately the execution time of each type of task over each kind of unit minimizing data transfers between memory banks performing data prefetching etc Expected advances In a nutshell a new runtime system API will be Chapter 1 Introduction to CHAMELEON 2 designed to allow applications to provide scheduling hints to the runtime system and to get real time feedback about the consequences of scheduling decisions 1 1 2 2 Runtime systems A runtime environment is an intermediate layer between the system and the application It provides low level functionality not provided by the system such as scheduling or man agement of the heterogeneity and high level features such as performance portability In the framework of this proposal we will work on the scalability of runtime environment To achieve scalability it is required to avoid all centralization Here the main problem is the scheduling of the tasks In many task based runtime environ
51. represented as an array of pointers to tiles located anywhere in memory Such layout would be a radical and unjustifiable departure from LAPACK and ScaLAPACK Flat tile layout is a natural progression from LAPACK s column major layout and ScaLAPACK s block cyclic layout Another related property of PLASMA s tile layout is that it includes provisions for padding of tiles i e the actual region of memory designated for a tile can be larger than the memory occupied by the actual data This allows to force a certain alignment of tile boundaries while using the flat organization described in the previous paragraph The motivation is that at the price of small memory overhead alignment of tile boundaries may prove benefivial in multiple scenarios involving memory systems of standard multicore processors as well as accelerators The issues that come into play are again the use of TLBs and memory banks or partitions Schematic illustration of the tile layout with column major order of tiles column major order of elements within tiles and optional padding for enforcing a certain alighment of tile bondaries courtesey of the PLASMA team 1 2 2 3 Dynamic Task Scheduling Dynamic scheduling is the idea of assigning work to cores based on the availability of data for processing at any given point in time and is also referred to as data driven scheduling The concept is related closely to the idea of expressing computation through a task graph
52. ributions being no more the parameters of a 2 D block cyclic grid but of another distribution 4 3 2 List of available routines 4 3 2 1 Auxiliary routines Reports MORSE version number int MORSE Version int ver major int ver minor int ver micro Initialize MORSE initialize some parameters initialize the runtime and or MPI int MORSE Init int nworkers int ncudas Finalyze MORSE free some data and finalize the runtime and or MPI int MORSE_Finalize void Return the MPI rank of the calling process int MORSE My Mpi Rank void Suspend MORSE runtime to poll for new tasks to avoid useless CPU consumption when no tasks have to be executed by MORSE runtime system int MORSE Pause void Symmetrical call to MORSE Pause used to resume the workers polling for new tasks int MORSE Resume void Conversion from LAPACK layout to tile layout int MORSE Lapack to Tile void Af77 int LDA MORSE desc t A Conversion from tile layout to LAPACK layout int MORSE Tile to Lapack MORSE desc t A void Af77 int LDA 4 3 2 2 Descriptor routines Create matrix descriptor internal function int MORSE Desc Create MORSE desc t desc void mat MORSE enum dtyp int mb int nb int bsiz int lm int ln int i int j int m int n int p int q Create matrix descriptor user function int MORSE Desc Create User MORSE desc t desc void mat MORSE enum dtyp int mb int nb int bsiz int lm int ln int i int j int m in
53. s It is very similar for the Cholesky factorization The left looking definition of Cholesky factorization from LAPACK is a loop with a sequence of calls to four routines xSYRK symmetric rank k update xPOTRF Cholesky factor ization of a small block on the diagonal xGEMM matrix multiplication and x TRSM triangular solve If the xSYRK xGEMM and xTRSM operations are expressed with the canonical definition of three nested loops and the technique of loop tiling is applied the tile algorithm results Since the algorithm is produced by simple reordering of operations neither the number of operations nor numerical stability of the algorithm are affected The situation becomes slightly more complicated for LU and QR factorizations where the classic algorithms factorize an entire panel of the matrix a block of columns at every step of the algorithm One can observe however that the process of matrix factorization is synonymous with introducing zeros in approproate places and a tile algorithm can be fought of as one that zeroes one tile of the matrix at a time This process is referred to as updating of a factorization or incremental factorization The process is equivalent to factorizing the top tile of a panel then placing the upper triangle of the result on top of the tile blow and factorizing again then moving to the next tile and so on Here the tile LU and QR algorithms perform slightly more floating point operations and require slightly
54. s through kernels provided by cuBLAS and MAGMA and clusters of interconnected nodes with distributed memory using MPI Computation of very large systems with dense matrices on a cluster of nodes is still being experimented and stabilized It is not expected to get stable performances with the current version using MPI 1 2 2 PLASMA s design principles CHAMELEON is originally based on PLASMA so that design principles are very similar The content of this section Section 1 2 2 PLASMA s design principles page 3 has been copied from the Design principles section of the PLASMA User s Guide 1 2 2 1 Tile Algorithms Tile algorithms are based on the idea of processing the matrix by square tiles of relatively small size such that a tile fits entirely in one of the cache levels associated with one core This way a tile can be loaded to the cache and processed completely before being evicted back to the main memory Of the three types of cache misses compulsory capacity and conflict the use of tile algorithms minimizes the number of capacity misses since each operation loads the amount of data that does not overflow the cache For some operations such as matrix multiplication and Cholesky factorization translat ing the classic algorithm to the tile algorithm is trivial In the case of matrix multiplication the tile algorithm is simply a product of applying the technique of loop tiling to the canon ical definition of three nested loop
55. subject to change it should be enrich in a near future Chapter 4 Using CHAMELEON 19 4 Using CHAMELEON 4 1 Using CHAMELEON executables CHAMELEON provides several test executables that are compiled and link with CHAMELEON stack of dependencies Instructions about the arguments to give to executables are accessible thanks to the option help or h This set of binaries are separated into three categories and can be found in three different directories e example contains examples of API usage and more specifically the sub directory la pack to morse provides a tutorial that explain how to use CHAMELEON functionalities starting from a full LAPACK code see Section 4 3 1 Tutorial LAPACK to CHAMELEON page 23 e testing contains testing drivers to check numerical correctness of CHAMELEON linear algebra routines with a wide range of parameters testing stesting 4 1 LANGE 600 100 700 Two first arguments are the number of cores and gpus to use The third one is the name of the algorithm to test The other arguments depend on the algorithm here it lies for the number of rows columns and leading dimension of the problem Name of algorithms available for testing are e LANGE norms of matrices Infinite One Max Frobenius e GEMM general matrix matrix multiply e HEMM hermitian matrix matrix multiply e HERK hermitian matrix matrix rank k update e HER2K hermitian matrix matrix rank 2k update e SYMM symmetric matrix matrix
56. t bump MORSE desc t x unsigned long long int seed Chapter 4 Using CHAMELEON 41 MORSE sequence t sequence MORSE request t request int MORSE zpirnt Tile Async MORSE desc t A unsigned long long int seed MORSE sequence t sequence MORSE request t request int MORSE zposv Tile Async MORSE enum uplo MORSE desc t x MORSE desc t B MORSE_sequence_t sequence MORSE request t request int MORSE zpotrf Tile Async MORSE enum uplo MORSE desc t x MORSE sequence t sequence MORSE request t request int MORSE zsytrf Tile Async MORSE enum uplo MORSE desc t x MORSE sequence t sequence MORSE request t request int MORSE zpotri Tile Async MORSE enum uplo MORSE desc t x MORSE sequence t sequence MORSE request t request int MORSE zpotrs Tile Async MORSE enum uplo MORSE desc t x MORSE_desc_t B MORSE sequence t sequence MORSE request t request if defined PRECISION c defined PRECISION z int MORSE zsytrs Tile Async MORSE enum uplo MORSE desc t x MORSE desc t B MORSE sequence t sequence MORSE request t request tendif int MORSE zsymm Tile Async MORSE enum side MORSE enum uplo MORSE Complex64 t alpha MORSE desc t x MORSE desc t xB MORSE Complex64 t beta MORSE desc t C MORSE sequence t sequence MORSE request t request int MORSE zsyrk Tile Async MORSE enum uplo MORSE enum trans MORSE Complex64 t alpha MORSE desc t x
57. t n int p int q void get blkaddr const MORSE desc t int int E int get blkldd const MORSE desc t int int get rankof const MORSE desc t int int l Destroys matrix descriptor int MORSE Desc Destroy MORSE desc t desc Ensure that all data are up to date in main memory even if some tasks have been processed on GPUs int MORSE Desc Getoncpu MORSE desc t desc Chapter 4 Using CHAMELEON 31 4 3 2 3 Options routines Enable MORSE feature int MORSE Enable MORSE enum option Feature to be enabled e MORSE WARNINGS printing of warning messages e MORSE ERRORS printing of error messages e MORSE AUTOTUNING autotuning for tile size and inner block size e MORSE PROFILING MODE activate kernels profiling Disable MORSE feature int MORSE Disable MORSE enum option Symmetric to MORSE Enable Set MORSE parameter int MORSE Set MORSE enum param int value Parameters to be set e MORSE TILE SIZE size matrix tile e MORSE INNER BLOCK SIZE size of tile inner block e MORSE HOUSEHOLDER MODE type of householder trees FLAT or TREE e MORSE HOUSEHOLDER SIZE size of the groups in householder trees e MORSE TRANSLATION MODE related to the MORSE Lapack to Tile see ztile c Get value of MORSE parameter int MORSE Get MORSE enum param int value 4 3 2 4 Sequences routines Create a sequence int MORSE Sequence Create MORSE sequence t sequence Destroy a sequence int MORSE Sequenc
58. thms e g sgemm for general matrix matrix multiply simple precision CHAMELEON provides three interface levels Chapter 4 Using CHAMELEON 23 MORSE name simplest interface very close to CBLAS and LAPACKE matrices are given following the LAPACK data layout 1 D array column major It in volves copy of data from LAPACK layout to tile layout and conversely to update LAPACK data see Section 4 3 1 2 Stepl page 24 MORSE name Tile the tile interface avoid copies between LAPACK and tile lay outs It is the standard interface of CHAMELEON and it should achieved better performance than the previous simplest interface The data are given through a specific structure called a descriptor see Section 4 3 1 3 Step2 page 25 MORSE name Tile Async similar to the tile interface it avoids synchonization barrier normally called between Tile routines At the end of an Async function completion of tasks is not guarentee and data are not necessarily up to date To ensure that tasks have been all executed a synchronization function has to be called after the sequence of Async functions see Section 4 3 1 5 Step4 page 27 MORSE routine calls have to be precede from MORSE_Init NCPU NGPU to initialize MORSE and the runtime system and follovved by MORSE_Finalize to free some data and finalize the runtime and or MPI 4 3 1 Tutorial LAPACK to CHAMELEON This tutorial is dedicated to the API usage of CHAMELEON The idea is
59. to start from a simple code and step by step explain how to use CHAMELEON routines The first step is a full BLAS LAPACK code without dependencies to CHAMELEON a code that most users should easily understand Then the different interfaces CHAMELEON provides are exposed from the simplest API stepl to more complicated ones until step4 The way some important parameters are set is discussed in step5 Finally step6 is an example about distributed computation with MPI Source files can be found in the example 1lapack to morse directory If CMake option CHAMELEON ENABLE EXAMPLE is ON then source files are compiled with the project libraries The arithmetic precision is double To execute a step X enter the following command step X optioni option2 Instructions about the arguments to give to executables are accessible thanks to the option help or h Note there exist default values for options For all steps the program solves a linear system Ar B The matrix values are randomly generated but ensure that matrix A is symmetric positive definite so that A can be factorized in a LL form using the Cholesky factorization Lets comment the different steps of the tutorial 4 3 1 1 StepO The C interface of BLAS and LAPACK that is CBLAS and LAPACKE are used to solve the system The size of the system matrix and the number of right hand sides can be given as arguments to the executable be careful not to give huge
60. ware to modulate process speed in order to satisfy limited energy budgets The MORSE associate team will tackle the first three challenges in a or chestrating work between research groups respectively specialized in sparse linear algebra dense linear algebra and runtime systems The overall objective is to develop robust lin ear algebra libraries relying on innovative runtime systems that can fully benefit from the potential of those future large scale complex machines Challenges 4 and 5 will also be investigated by the different teams in the context of other partnerships but they will not be the main focus of the associate team as they are much more prospective 1 1 2 Research fields The overall goal of the MORSE associate team is to enable advanced numerical algorithms to be executed on a scalable unified runtime system for exploiting the full potential of future exascale machines We expect advances in three directions based first on strong and closed interactions between the runtime and numerical linear algebra communities This initial activity will then naturally expand to more focused but still joint research in both fields 1 1 2 1 Fine interaction between linear algebra and runtime systems On parallel machines HPC applications need to take care of data movement and consistency which can be either explicitly managed at the level of the application itself or delegated to a runtime system We adopt the latter approach in order to bette

Download Pdf Manuals

image

Related Search

Related Contents

Samsung GT-S5830 用户手册  Becs doseurs - Lee Valley Tools  ccdbb  Technical Specification FLUXUS® F608**-F2    OPPO Digital OPDV971H DVD Player User Manual  Alert Condensable Composite Translated - Directions For  American Dryer Corp. MDG30PCC User's Manual  User Manual - Kutai Electronics  Samsung 9600  

Copyright © All rights reserved.
Failed to retrieve file