Home

Report on knowledge transfer and training

1. 6 8 10 Clock cycle millions 12 14 16 18 Fig 2 Fibonacci 35 number of threads in four single node configurations Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 17 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 In the following results of the execution of the Fibonacci benchmark are discussed In particular Fig 2 shows the number of threads waiting ready and running in the system during the execution of the recursive computation of the 35th term of the Fibonacci series targeting four different single node configurations 4 8 16 and 32 cores The figure highlights two aspects First the maximum number of threads created in the system is 1 5M over all different configurations Second the execution time is reduced by a half when the number of cores in the node doubles These results show that the COTSon simulator is now able to support the T execution model a dataflow execution model achieving almost perfect scaling However the timing model still has to be tuned up by connecting the existing memory hierarchy timing models of COTSon to the T components such activity is ongoing and briefly described in section 2 3 1 In Fig 3 we also show a zoom of the botto
2. r32 DF TREAD 4 index for A r30 DF TREAD 5 index for B r5 DF TREAD 6 size r26 DF TREAD 7 the sum for this result matrix element r34 DF TREAD 8 r10 DF TREAD 9 start index r58 DF TREAD 10 pointer to the FM of thread join threads r7 DF TREAD 11 the end index for this part r55 DF TREAD 12 log size uint64 t pC uint64 t r4 r4 holds the address pointer to the matrix C pC r10 r26 10 if r10 r7 means this is the last element schedule mul thread else schedule mul thread next el uint8 t r10 r7 uint64 t r44 DF TSCHEDULE mul thread end 8 r44 DF TSCHEDULE cnd mul thread next el 8 DF TWRITE r2 r44 1 DF TWRITE r3 r44 2 B DF TWRITE r4 r44 3 DF TWRITE r10 r44 4 start index DF TWRITE r55 r44 5 log size DF TWRITE r5 r44 6 size DF TWRITE r58 r44 7 pointer to the FM of thread join threads DF TWRITE r7 r44 8 the end index for this part DF TDESTROY void mul thread end void DF_TLOAD 1 uint64_t r58 DF_TREAD 7 pointer to the FM of thread join_threads DF_TWRITE 1 r58 1 fake write needed to signal the thread join_threads DF TDESTROY void join threads void DF_TLOAD 2 uint64 t r4 DF TREAD 4 pointer to the result matrix C r5 DF
3. ddm function pragma ddm endfunction pragma ddm endprogram Fig 22 The side panel plug in automatically closing the DDM pragmas The side panel plug in autocompletes the ending closing macros for a DDM pragma after pressing Enter at the end of a pragma directive line Fig 22 Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 41 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 Resource testfile main c Eclipse Platform File Edit Source Refactor Navigate Search Project Run Window Help Mv iy H S S a Resource P 3 c main c 2 Sample View z BSle Thread us sestfile include lt stdio h gt Property Value l pragma ddm kernel 16 thread number 1 Xj test xr sched_mode DVM_RROBIN MAIN L sched_value 0 pragma ddm thread 1 kernel DVM RROBIN 0 readycount 5 al readycount number 5 arity number 2 pragma ddm end thread 2 update x 1 import var export var 8 import_export var pragma ddm thread 2 kernel DVM STATIC 3 readycount 4 2 a a 4 H Tasks amp pA 0 items BI stdioh Descrip
4. tested routine execution of a running system for several billion instructions as well as all the binaries shipped with our standard Linux distribution without any occurrence of that instruction With the above choices the overloaded instruction encoding looks as follows Of 18 84 rr XX II af 2d 0 12 3 4 5 6 7 where OxOF18 is the x86 opcode for prefetchnta 0x84 is the value of the MOD R M field of the prefetchnta instruction rr 1 byte that was corresponding the SIB byte IT 1 byte and XX 1byte the two remaining byte from the displacement Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 11 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 This allows us to use e The rr value for encoding two x86 registers is used in the T instruction We currently chose to limit the registers to the core set available in both 32b and 64b x86 ISA variants for simplicity but we may extend the choice to more 64b registers in the future if the need for additional registers arises e The XX value for encoding the T opcode for up to 256 opcodes e The II value for encoding an 8 bit T immediate if needed or other 2 registers like for the rr field Let s consider as an example what happens with a TREAD operatio
5. In Table 1 we provide the full list of all the T ISE opcode 1 all the possible values for the XX field introduced so far in the COTSon simulator Table 1 OPCODEs for T instructions the instructions with the grey background in this table have been reported for completeness but have not yet been fully implmented in the simulator OPCODE INSTRUCTION OPCODE INSTRUCTION OPCODE INSTRUCTION 0x01 TINIT OxOD TSCHEDULEI 0x19 TDECREASEN 0x02 TSCHEDULE Ox0E TREADOI Ox1A TDECREASENI 0x03 TREAD OxOF TWRITEOI Ox1B TWRITEP 0x04 TWRITE 0x10 TSCHEDULEP Ox1C TWRITEP 0x05 TALLOC 0x11 TESCHEDULEPI 0x1D TWRITEQPI 0x06 0x12 TLOAD OxlE TSCHEDULEZ 0x07 TPOLL 0x13 TSTORE OxlF TWRITE32P 0x08 TRESET 0x14 TSTOREQI 0x20 TWRITES2PT 0x09 TSTAMP 0x15 TSCHEDULEF 0x21 TDESTROY 0x16 TSCHEDULEFI 0x22 TSTOREPI OxOB TREADI 0x17 TCACHE 0x0C TWRITEI 0x18 TDECREASE 2 2 1 Brief Introduction to COTSon s Implementation of The set of supported T ISE currently experimented is the following e tschedulepi tid ip 9ocnd sc Schedules conditionally a thread with address in register ip to be executed Register cnd holds the predicate The immediate sc holds the synchronization count 0 255 It returns a thread handle in register tid or 0 if the predicate is false The
6. cond update threadID 6start update value 6 endblock endfor endfunction endprogram endthread for thread number kernel sched_mode sched_ for thread number reduction private_var oglobz 4 m Press 5 to show DDM Pragmas Proposal Computer Fig 18 The content assistant plug in listing the available DDM keywords 3 3 1 The Content Assistant Plug in Fig 18 illustrates the basic functionality of the content assistant plug in While a user is writing a pragma directive by typing pragma ddm after leaving a blank space and pressing the CRTL SPACE key combination a proposal window will appear with all the available options for that specific pragma Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 39 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 DVM CUSTOM DVM DYNAMIC 1 DVM MODULAR DVM RROBIN DVM STATIC Press Ctri Space to show DDM Pragmas Proposal Computer pragma ddm thread 5 kernel DVM Fig 19 The content assistant plug in filtering the DDM keywords starting with DVM_ for the scheduling policy field of the thread pragma In Fig 19 the user is already editing a DDM pragma so
7. r2 r59 1 A DE 595 2 JB rl2 r59 3 r4 r59 4 size TE r5 r59 5 np rl4 r59 6 r10 r59 7 size size num processors represents this part size r55 r59 8 log size r58 r59 9 pointer to the FM of thread join threads 116 r59 10 next proc TROY void DF TREAD 1 r3 DF TREAD 2 B r4 DF TREAD 3 r5 DF TREAD 4 size r6 DF TREAD 5 r7 DF TREAD 6 size size np r55 DF TREAD 7 log size r58 DF TREAD 8 pointer to the FM of thread join threads r7 holds the end index r6 r10 takes the start index DF TSCHEDULE 1 mul thread next el 8 r2 r44 1 r3 r44 2 B r4 r44 3 r10 r44 4 start index r55 r44 5 log size TE r5 r44 6 size r58 r44 7 pointer to the FM of thread join threads r7 r44 8 the end index for this part TROY void Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 48 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 uint64 t r2 DF TREAD 1 A r3 DF TREAD 2 B r4 DF TREAD 3 r10 DF TREA
8. 2 5 1 Framework design A scheme of the framework for multi node simulation is shown in Fig 5 The access to the DF Frame information among nodes is provided through shared memory allocated on the host machine Such shared data structures hold 1 a Circular Queue for holding the continuations of created DF Threads which are not ready for execution and 2 the Ready Queue for those threads whose synchronization count has reached zero A Scheduler is responsible for managing properly these queues In the current implementation the Scheduler distributes the ready DF Threads among nodes following a simple round robin policy Nodes can access the DF Frame Memory through a message queue to a high level entity we called Manager Such manager is responsible for allocating deallocating DF Frame Memory dynamically Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 20 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 A timing model for the multi node framework will be designed and developed in the next period as an extension to the single node timing model and is currently under development 2 5 2 Demonstration of multi node capability of the new distributed scheduler Fig 6 shows the speedup with respect to the single core case
9. London UK ACM 2012 Kavi01 Kavi K M Giorgi R Arul J Scheduled Dataflow Execution Paradigm Architecture and Performance Evaluation IEEE Trans Computers Los Alamitos CA USA vol 50 no 8 Aug 2001 pp 834 846 Koren07 Koren I Krishna M C Fault Tolerant Systems San Francisco CA USA Morgan Kaufmann Publishers Inc 2007 MCPATO09 Sheng Li Jung Ho Ahn Strong R D Brockman J B Tullsen D M Jouppi N P McPAT An integrated power area and timing modeling framework for multicore and manycore architectures In Proc of the 42nd Annual IEEE ACM International Symposium on Microarchitecture 2009 MICRO 42 12 16 Dec 2009 469 480 Portero11 Portero A Zhibin Yu Giorgi R T star t An x86 64 isa extension to support thread execution on many cores ACACES Advance Computer Architecture and Compilation for High Performance and Embedded Systems 1 277 280 2011 Portero12 Portero A Scionti A Zhibin Yu Faraboschi P Concatto C Carro L Garbade A Weis S Ungerer Giorgi R Simulating the Future kilo x86 64 core Processor and their Infrastructure 45th Annual Simulation Symposium ANSS March 2012 Orlando Florida Ronen97 Ronen R Method of modifying an instruction set architecture of a computer processor to maintain backward compatibility patent US5701442 Dec 1997 SF http cotson svn sourceforge net viewvc cotson SimNow09 AMD SimNow Simulator 4
10. FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 particular to the internal structures that are activated during the execution of each instruction Combining these statistics with a description of the specific modeled microarchitecture the tool can estimate static and dynamic power consumption components e g power consumption for the cache memories power consumption for the cores etc timing and area utilization In order to enable the simulated system to schedule DF threads according to policies that count for the current power consumption as well as the temperature and the reliability level the system must be equipped with a power fault and temperature measurement system From the perspective of the simulator this goal can be obtained by integrating the McPAT tool within the COTSon simulator 2 6 1 Off line vs on line Power estimation As a first step towards a complete integration McPAT has been enabled to run at the end of the each heartbeat computing power estimation on a periodic base Periodic power estimation is obtained storing execution statistics coming from the COTSon simulator at every heartbeat The heartbeat represents the internal interval used by the simulator to store the statistics the interval size is not fixed In the off line approach McPAT is run at the end of the COTSon simulation it processes all recorded heartbeats in sequence at the end of the execution of the program On the contrary in
11. Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 e twriteqi tid tval im Writes the 64b value in register tval to the location at im immediate offset of the frame of thread tid This is the immediate form with im lt 256 The offset im is expressed in 64b words i e offset 2 is byte 16 For im gt 255 or variable use the general form TWRITE e talloc and tfree are encoded but semantics to be defined The above instructions correspond to the instructions TSCHEDULE TDESTROY TREAD TWRITE TALLOC TFREE as introduced in the previous deliverable D6 2 see Table 1 Additionally we are currently experimenting with other instructions tschedule tid ip sc Schedules the thread unconditionally while the start address is located in register ip Register sc contains the synchronization count tschedule returns a thread handle in register tid By design we decided to use thread handles expressed on 32 bits moreover for efficiency reasons we store such handles on the 32 most significant bits of tid In this way we can do standard address arithmetic on thread handles e g add an offset to obtain the address of an individual element of the thread frame almost as if they were addresses This is the general form used with variable sc or sc 255 For immediate version sc 256 tschedulei is more efficient tschedulei tid ip sc Schedules thread
12. Release Approval Name Role Date Marco Solinas Originator 08 11 2012 Roberto Giorgi WP Leader 28 11 2012 Roberto Giorgi Coordinator 13 12 2012 Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 1 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 TABLE OF CONTENT GLOSSARY qr 4 EXECUTIVE SUMMARY E P 7 TO INTRODUCTION 8 1 1 RELATION TO OTHER DELIVERABLES tenens sinn essere enn 8 1 2 ACTIVITIES REFERRED BY THIS DELIVERABLE eren rennen nennen eene 9 1 3 SUMMARY OF PREVIOUS WORK FROM D7 1 07 2 AND 07 3 9 2 NEW SIMULATION FEATURES cccccsssssssssssscssssccsscsssssccsssccssssssssccsssccsssssssessssccessessssscsssscsssessasees 10 2 1 BRIEF OVERVIEW OF THE TERAFLUX EVALUATION PLATFORM ALL WP7 PARTNERS 10 22 T INSTRUCTION AND BUILT IN SUPPORT IN THE C LANGUAGE UNISI HP 11 2 2 1 Brief Introduction to COTSon s Implementation 13 2 3 NEW T BENCHMARKS UNDSD 1 eter to terea et too eb on e eek ur eee poe UE ER 16 23
13. e Inside the TSU the DF Frame memory and the DF Frame cache are modeled For example we can assume that the access latency to DF Frame cache is equal to Core Level Cache Hierarchy CL H in the Architectural Template presented in D6 2 Figure 1 and the latency access to physical DF Frame memory is equal to normal memory access e The latency feedback for these accesses in the TSU is passed to the timer in COTson SIMNOW Frame accesses i Frame Memory Modeling Frame Cache Modeling COTSON Fig 4 Timing model for the T execution In order to provide the COTSon user with an easy way to model the architecture for example with the purpose of exploring different configurations which are characterized by different timings we define the size of DF Frame cache DF Frame memory queues in a configuration file e g the tsu lua file which is processed by COTSon In the current simulator integration we have implemented the filtering of T instructions and memory accesses into TSU The next steps will be modeling DF Frame memory and DF Frame cache 2 5 Multi Node T Tests UNISI The simulation environment described in section 2 3 created the basis for single node simulations we decided not to exceed the size of 32 cores per node current commercial processors like the AMD 6200 encompass 16 cores per processor In order to simulate systems with a higher number of cores the number of nodes of the target m
14. r4 r4 uint64 t r2 malloc r8 sizeof uint64 t matrix A uint64 t r3 malloc r8 sizeof uint64 t matrix B for i 0 i lt r8 i r2 i rand r3 i rand tt df_tstamp ts0 START TIMING uint64 t r12 malloc r8 sizeof uint64 t matrix AxB for i 0 i lt r8 r12 i 0 uint64 t r14 0 eib 0 rl0 2 r8 r5 size size num processors r55 log2 r4 log size uint64 t r13 r5 2 uint64 t r58 DF TSCHEDULE 1 join threads 10 DF TWRITE r12 r58 4 write in the FM of join threads the pointer to matrix C DF TWRITE r4 r58 5 write size of the result matrix C uint64 t r59 DF TSCHEDULE 1 main ep 1 10 DF TWRITE r2 r59 1 DF TWRITE r3 r59 2 B DF TWRITE r12 r59 3 DF TWRITE r4 r59 4 size DF TWRITE r5 r59 5 np DF TWRITE r14 r59 6 DF TWRITE r10 r59 7 size size num processors represents this part size DF TWRITE r55 r59 8 log size DF TWRITE r58 r59 9 pointer to the FM of thread join threads DF TWRITE r16 r59 10 return 0 void main ep 1 void frame is the frame pointer of the thread fib DF TLOAD 10 uint64 t r2 DF TREAD 1 A r3 DF TREAD 2 B r12 DF TREAD 3 Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 47 of 50 Pr
15. source directory in which SimNow binary package is stored If this directory is empty PIKE tries to download the SimNow binary package files directly from the AMD website simnow SimNow installation directory if it exists log log directory file where statistics error and output are stored at the end of the simulation Table 2 Path to the directory needed by PIKE If the path of a specific directory is not specified in the configuration file it is searched in the WorkingDirectory It is possible to create a skeleton of the WorkingDirectory using the script create skel sh inside the tools directory The PIKE directory has the structure shown in Table 3 lib pike contains the libraries and classes for the pike operations bin contains the main PIKE scripts Table 3 Structure of the PIKE directory 3 2 2 Functions Exposed to the User PIKE currently allows the user to automate the execution of batch simulations It allows specifying custom parameters in order to explore different hardware configurations for the target system Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 35 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 together with control parameters eventually needed by the benchmarks Such parameters
16. 9 6183584 total 10 6919204 10 6919764 10 6919534 10 6918994 x s s i s 8 Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 22 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 Simulation McPAT Statistics Configuration fna McPAT i configuration gt XML 9 McPAT generation COTSon Power consumption vs heartbeats PLOT CPU 0 Power in 4 Periods 3 Total Leakage power 172 667 247 167 329 501 247 334 Duration ms Fig 7 Power estimation sample outputs The off line power estimation process starts with a complete simulation running on the COTSon simulation infrastructure During the simulation all the relevant statistics are collected through the internal timer components of the simulator within a SQL local database The database also contains the main configuration parameters of the simulated machine Simulation statistics are organized on a per heartbeat basis At the end of each heartbeat the content of the database is parsed in order to provide for each heartbeat an XML based configuration file for the McPAT tool The XML configuration file contains both the main statistics for the current heartbeat and the m
17. Data Driven Multithreading DDM model are faced with two difficulties First given the nature of the model being based on the dataflow execution of threads the users need to make an analysis of the problem and split it into threads and find the data dependency relations among those threads This is usually the hard step of the programming But in addition the second difficulty is that in order to express these threads and dependencies the users need to use a new set of directives in their programs In order to address this last issue we have developed a plug in for Eclipse that helps the programmers with the task of adding the DDM directives to their code and also integrates in an easier way the different tools needed to generate the DDM executable The DDM Eclipse plug in is composed of three modules the Content Assistant which shows a drop down list of available pragma directives while the user is coding the Side Panel which displays a panel next to the code that shows available directives and their arguments and the Pre processor integration which offers the ability to call the DDM pre processor and generate the code from within Eclipse The following figures show different screenshots from the procedure of developing a DDM code using the new Eclipse plug in TEST ALL THE PRAGMAS ddm block number cond update threadID start_update_value cond update threadID start_update_value
18. FRAME include stdio h include stdlib h include time h include math h include tsu h DF TSCHEDULE cond ip sc df tschedule cond ip sc cond DF TWRITE val tid off df write tid off val DF TWRITEN val tid off df writeN TLOC tid off val DF TREAD off df frame off DF TLOAD n df ldframe DF TDESTROY df destroy reporting help uint64 t tt uint64 t ts0 100 ts1 100 df threads pre declaration void main ep 1 void void main end void void mul thread void Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 46 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 void void void void void calc_curr_el void mul thread end void join threads void void usage printf NnMatrig Moltiollern 4 KO XO NOI RARO S size of the printf fflush stdout int main int argc uint64 t r4 r5 i if arge 3 usage return 1 r4 r5 atoi argv 1 atoi argv 2 srand time NULL move_to_next_el void matrix number of processors mul_thread_next_el void mmul s np nwhere n squared matrix n np number of available cores Mn char argv size uint64 t r8
19. Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 Transactional Memory TMS Transactional Memory Support TP Threaded Procedure Virtualizer Synonymous with Emulator VCPU Virtual CPU or Virtual Core Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 5 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 The following list of authors will be updated to reflect the list of contributors to the writing of the document Marco Solinas Alberto Scionti Andrea Mondelli Ho Nam Antonio Portero Stamatis Kavvadias Monica Bianchini Roberto Giorgi Universita di Siena Arne Garbade Sebastian Weis Theo Ungerer Universitaet Augsburg Antoniu Pop Feng Li Albert Cohen INRIA Lefteris Eleftheriades Natalie Masrujeh George Michael Lambros Petrou Andreas Diavastos Pedro Trancoso Skevos Evripidou University of Cyprus Nacho Navarro Rosa Badia Mateo Valero Barcelona Supercomputing Center Paolo Faraboschi Hewlett Packard Espa ola Behram Khan Salman Khan Mikel Lujan Ian Watson The University of Manchester 2009 13 TERAFLUX Consortium All Rights Reserved Document marked as PU Public is published in Italy for the TERAFLUX Consortium on the
20. an interface for writing C applications has also been realized we report in section 2 4 a brief description of some kernel benchmarks that we realized while the compiler support for generating T applications is reported in section 2 9 The extension of the TSU to the multi node case is now available to partners as described in section 2 5 in section 2 4 we describe the first steps of the implementation of a timing model for T instructions in the single node case which is still an ongoing activity The available mechanism for estimating power consumption is reported in section 2 6 In section 2 7 and 2 8 the activities performed for integrating in COTSon the DDM style hardware scheduler are reported The implementation of the FDU mechanisms for double execution and thread restart recovery are described in section 2 10 while section 2 11 provides a description of the fault injection model The enhanced support for Transactional Memory for the multi node case to COTSon is discussed in section 2 12 Finally in section 3 we describe the simulation environment and the support that was made available to the Partners from both the hardware side and software side Moreover in section 3 5 we report on some training events on OmpSS organized by BSC and opened to TERAFLUX partners 1 1 Relation to Other Deliverables The activities under the WP7 are related to the integration of the research performed in the other TERAFLUX workpackages In particu
21. cript est log echo RELEASE key QtKey 1000023 scancode 0x38 oo L425 SS access official Ubuntu documentation please visit nttp 7help ubuntu com oot8cotson t xget datavcluster sh b sh b ill vetcvhosts son_netuork_restart stopping all network commands generating etc network interfaces cleaning etc resolv conf baptize nic setting hostname start networking Restarting OpenBSD Secure Sheli server sshd rint list file inary_t node_conf node_conf ig log set pro 1iser script ytmp test log echo Runnina Fig 17 Two SimNow windows in case of multiple simulation PIKE run Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 38 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 3 3 The Eclipse Module for TFLUX UCY In the context of WP3 we explored the augmentation of the data flow model with the support for transactions In this workpackage WP7 we report our progress on providing tools to additional transferring the knowledge of TFLUX in particular we present here an Eclipse module for TFLUX Programmability is a major challenge for future systems as users need to adopt new models as to fully exploit the potentials of such systems The users that wish to program using the
22. host machine Fig 13 shows the trend if we increase the number of virtual nodes As expected the main memory consumption and the CPU utilization on the host increase We achieved to simulate 220 nodes of 32 cores 7040 cores in total using the 92 of the main memory and the 93 of the host CPU utilization This demonstrates the Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 33 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 ability of the proposed simulation framework to scale the simulations to 1 kilo core range and beyond up to 7 kilo cores were tested 3 2 PIKE Automatizing Large Simulations UNISI Many steps that are necessary to setup a COTSon simulation requires the knowledge of many details that slowdown the learning curve of using our simulation platform Therefore UNISI decided that a good way to improve the knowledge transfer would have been to provide an additional tool to easy this process this tool is called PIKE COTSon is a full system simulation infrastructure developed by HP Labs to model complete computing systems ranging from multicore nodes up to clusters with complete network simulation A single simulation requires the configuration of various parameters by editing a configuration file written in th
23. precedence between two threads is highlighted with arrows so that the source of the arrow is in the scheduling thread and points to the scheduled thread current block np current elem last join move to _ M currei threads next_el last Fig 24 Dataflow graph for the blocked Matrix Multiplication algorithm The main thread is responsible for reading the input values from the command line and unconditionally scheduling two threads the join threads with a synchronization count of np 2 and the main ep 1 thread with synchronization count 10 The join threads represents the very last thread of the algorithm it has to wait for 2 data to be written in its frame memory that is the pointer to the matrix C and its size plus np fake values one for each partition of the result matrix written at the offset zero of the frame once every sub block has been calculated This mechanism allows the join threads to synchronize its execution it will run only when all the sub blocks are ready because its synchronization count will be reduced to zero by the last TWRITE operation for the fake value at offset zero The main ep 1 thread receives in its frame memory from the main thread all the information needed for execution e g memory pointers for matrix A B and C size of the matrices and of each sub block etc it is responsible for unconditionally scheduling the mul thread which is
24. simulation server is available to all the TERAFLUX partners Finally tutorial sessions on OmpSS have been organized by BSC such tutorials were open to all the TERAFLUX partners Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 7 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 1 Introduction The main objective of the workpackage WP7 is to drive the integration of the research performed by each TERAFLUX partner This is done mainly by means of a common simulation infrastructure the COTSon simulator which can be modified by partners in order to meet their research needs while transferring the reciprocal knowledge to the other partners In this report we provide a summary of the activities performed by the TERAFLUX Consortium during the third year of the project working on the common evaluation platform see section 2 1 for an introduction to this concept As the content of this Deliverable shows the knowledge transfer about the simulation infrastructure to the TERAFLUX Partners has been very successful The T instructions have been introduced as an extension of the x86 64 ISA as designed in D7 2 and are now integrated in the simulator we provide a high level description of the fundamental mechanisms in section 2 2 Since
25. the on line approach McPAT is run during the simulation right after a heartbeat has been produced while the program is still running Fig 7 shows the current tool chain used to estimate power consumption with an off line on line processing and some first sample of output HEARTBEATS 1 5 6 10 11 15 16 20 Core clock 3000 MHz 3000 MHz 3000 MHz 3000 MHZ Cycles 518001021 741501468 988501962 742001469 cc Time 172 667 msec 247 167 msec 329 501 msec 247 334 msec CPU cpu Subthreshold Leakage power 2 39913 W 2 39913 W 2 39913 W 2 39913 W Gate Leakage power 0 0054596 W 0 0054596 W 0 0054596 W 0 0054596 W Total Leakage power 2 4045896 W 2 4045896 W 2 4045896 W 2 4045896 W Runtime Dynamic power 0 269286 W 0 269339 W 0 269316 W 0 269265 W Total power 2 6738756 2 6739286 W 2 6739056 12 6738546 W Core clock 3000 MHz 3000 3000 MHz 3000 MHz Cycles 518001021 cc 741501468 cc 988501962 cc 742001469 cc Time 172 667 msec 247 167 msec 329 501 msec 247 334 msec CPU lt cpul gt Subthreshold Leakage power 2 39913 W 2 39913 W 2 39913 W 2 39913 W Gate Leakage power 0 0054596 W 0 0054596 W 0 0054596 W 0 0054596 W Total Leakage power 2 4045896 W 2 4045896 W 2 4045896 W 2 4045896 W Runtime Dynamic power 0 268092 W 0 268093 W 0 268093 W 0 268
26. unconditionally with address in register ip to be executed Immediate sc holds the synchronization count 0 255 It returns a thread handle in register tid The tid is guaranteed to have bits 0 31 at see TWRITE Constraint tid and ip must specify the same register identifier 1 the same x86 64 register For variable sc or sc gt 255 the general version TSCHEDULE is required e tschedulep tid ip 5 Schedules thread conditionally with address in register 91 to be executed Register 96sccnd packs sc sync count and cnd predicate as sccnd sc lt lt 1 cnd t returns a thread handle in register tid or 0 if the predicate is false The tid is guaranteed to have bits 0 31 at 0 see TWRITE This is the general form used with variable sc or sc gt 255 For immediate version sc 256 tschedulepi is more efficient e tschedule tid ip sc Schedules the thread unconditionally with the start address in register ip Register sc contains the synchronization count tschedule returns a thread handle in register tid By design we decided to use thread handles expressed on 32 bits moreover for efficiency reasons we store such handles on the 32 most significant bits of tid In this way we can do standard address arithmetic on thread handles e g add an offset to obtain the address of an individual element of the thread frame almost as if they were addresses This is the general form
27. very high number of cores 7000 cores in recent tests In order to achieve this goal we need a powerful simulation system Currently we use as host machine a DL Proliant DL585 G7 based on AMD Opteron 6200 Series TFX3 which provides 64 cores coupled to 1 TB DRAM of shared main memory There is a trade off between complexity of the guest machine and the time required by the simulation Higher complexity in the guest machine number of simulated cores memory etc produces longer simulations A good trade off is to use one host core for each functional instance i e a functional instance is equivalent to a node in the simulated chip architecture representing a node Each node can have up to 32 cores but we found out that 16 x86 64 cores per node can better scale up in terms of execution time Hence the simulation of a thousand core system can be achieved by distributing the Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 32 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 simulation to more than one host However since we want to focus on the simulation of a 1K core system considering a single host machine is sufficient In order to correctly simulate a kilo core architecture we booted up 64 virtual nodes each one contain
28. www teraflux eu web site and can be distributed to the Public The list of author does not imply any claim of ownership on the Intellectual Properties described in this document The authors and the publishers make no expressed or implied warranty of any kind and assume no responsibilities for errors or omissions No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information contained in this document This document is furnished under the terms of the TERAFLUX License Agreement the License and may only be used or copied in accordance with the terms of the License The information in this document is a work in progress jointly developed by the members of TERAFLUX Consortium TERAFLUX and is provided for informational use only The technology disclosed herein may be protected by one or more patents copyrights trademarks and or trade secrets owned by or licensed to TERAFLUX Partners The partners reserve all rights with respect to such technology and related materials Any use of the protected technology and related material beyond the terms of the License without the prior written consent of TERAFLUX is prohibited This document contains material that is confidential to TERAFLUX and its members and licensors Until publication the user should assume that all materials contained and or referenced in this document are confidential and proprietary unless otherwise indicated or apparent fro
29. 009 8 1 Scalable Transactional Memory mechanisms have been built on top of these protocols This timing support is separate from the functional simulation in the SimNow analyzer module described above but needs to be used in conjunction with it On the level of timing simulation the TM systems supported are lazy lazy implementations Further details are described in Deliverable 6 3 As with the functional module TM timing support is available in a branch on Sourceforge This includes the TM models themselves distributed directory based cache coherence network simulation and the scripts documentation and tests to go with these Similar to SimNow this work exposed several bugs in COTSon for which we have contributed fixes in conjunction with our partners at HP Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 31 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 3 Development and Simulation environment and supports The adoption of architectural simulators has become essential for assuring the correctness of any design Architectural simulators historically suffered from low simulation speed and accuracy imposing serious limitations on the ability of predicting correct behaviors of the designed architecture Portero12 Gior
30. 092 W Total power 2 6726816 2 6726826 W 2 6726826 W_ 2 6726816 W Core clock 3000 MHz 3000 MHz 3000 MHz 3000 MHz Cycles 518001021 741501468 cc 988501962 742001469 cc Time 172 667 _ msec 247 167 msec _ 329 501 msec 247 334 msec CPU lt cpu2 gt Subthreshold Leakage power 2 39913 W 2 39913 W 2 39913 W 2 39913 W Gate Leakage power 0 0054596 W 0 0054596 W 0 0054596 W 0 0054596 W Total Leakage power 2 4045896 W 2 4045896 W 2 4045896 W 2 4045896 W Runtime Dynamic power 0 268092 W 0 268093 W 0 268093 W 0 268092 W Total power 2 6726816 2 6726826 W 2 6726826 W_ 2 6726816 W 3000 MHz 3000 3000 741501468 988501962 742001469 247 167 Nmsec 329 501 msec 247 334 SSubthreshold Leakage power 2 39913 2 39913 2 39913 W 2 39913 Gate Leakage power 0 0054596 0 0054596 0 0054596 0 0054596 Core clock 3000 MHZ Imsec Ww Ww Total Leakage power 2 4045896 W 2 4045896 2 4045896 2 4045896 Ww Ww W Cycles 518001021 Time 172 667 6 CPU lt 3 gt WwW WwW Ww Ww W Runtime Dynamic power 0 268092 0 268093 W 0 268093 W 0 268092 Total power 2 6726816 2 6726826 W 2 6726826 W 2 6726816 W W W W W Dynamic 1 073562 1 073618 1 073595 1 073541 ll CPU total power Leakage 9 6183584 9 6183584 W 9 6183584 W
31. 1 Matrix Multiplier see on tee onere ore rers tite bet 16 2 9 2 Other Benchmarks dah edes des esae layed pude bee un Erde ea qe no 16 2 4 SINGLE NODE L TESTS UNIS D ice teer toe eee aaa nore e ent x 17 24 1 TP Timing Model ertet t ep rre ceseusesodeeeses vase 18 2 5 MUETI NODE T TESTS UNISI eite e entente etu duro 19 22 iEr dWmework desigfis iere aio 20 2 5 2 Demonstration of multi node capability of the new distributed scheduler s 21 2 6 POWER ESTIMATION USING MCPAT UNISI eeseseeeeeeeee nenne enne nnne 21 2 6 1 Off line vs on line Power estimation essei esee theirs ethernet nns 22 2 7 EXECUTION OF USER LEVEL DDM ON COTSON UCY ccccecssssecsssneeeeeeeeeessneeeceeaeeeeseneaeeeeseneeeseeaees 24 2 8 INTEGRATING DDM TSU INTO COTSON UCY eesseeesseseeeeeeee eene nennen eene senten nnne nnne 25 2 9 GCC BACKEND AND OPENSTREAM EXPERIMENTS ON COTSON INRIA eee 25 2 10 DOUBLE EXECUTION AND THREAD RESTART RECOVERY IN A SINGLE NODE COTSON MODULES P E 27 2 10 1 EDU subsystem in COTS OM terree ere irt aaa tena a eheu esee rep esee noeh 27 2 10 2 Double execution and Recovery Support esee tne entente ttn 27 2 11 HIGH LEVEL FA
32. 6 1 User s Manual November 2009 Available at http developer amd com tools cpu development simnow simulator TFX3 http h10010 www1 hp com wwpc us en sm WF06a 1535 1 15351 3328412 241644 3328422 4194641 html dnr 1 x86 Intel 64 and 32 Architectures Software Developer s Manual Vol 2 Instruction Set Manual March 2010 Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 44 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 Appendix A The Matrix Multiplication developed at UNISI needs the number s of rows and columns of the square matrices A and B and np for the number of partitions within the result matrix to which the multiplication algorithm is recursively applied as input parameters After a first construction phase for the A and B matrices which are composed by s random integer elements the algorithm allocates the C matrix of the same size then partitions the C in np sub blocks At this point the multiplication algorithm is applied to each sub block Fig 24 shows the structure of the dataflow version of the algorithm implemented using the T extension to the x86 64 ISA referring to its implementation introduced in Section 2 1 Each DF Thread of the algorithm is represented with a circle
33. COTSon setup did not turn out successful due to problems with the data communication support across multiple COTSon nodes We developed a small benchmark program for the communications layer and we were able to identify that COTSon did not progress when the user sends messages larger than 2KB To overcome this issue we developed an intermediate communication layer in the TSU network unit that accepts messages of any size and splits them into smaller packets to achieve successful communication As it is shown in Fig 8 we have managed to successfully execute a DDM application using 4 nodes on COTSon with the user level TSU This configuration was compiled on the tfx2 machine i e one of the simulation hosts provided by UNISI with 48 cores and 256 GB of shared memory provided by UNISI Numeric Display s Numeric Display s Numeric Display s Numeric Display s r Simulator Stats 1r Simulator Stats Simulator Stats r Simulator Stats r IDE Primary Display 899 39 Host Seconds 398 69 Host Seconds 398 53 Host Seconds 398 55 Host Seconds 0 master read 713 99 Sim Seconds 713 28 Sim Seconds 714 73 Sim Seconds 714 17 Sim Seconds 0 master written 3 92 Avg MIPS H 4 07 Avg MIPS 4 09 Avg MIPS Reset Av 413 Avg MIPS Reset 184 520 slave read 0 42 MIPS 1 0 05 MIPS 0 00 MIPS MIPS EIAS PIO DMA mode 3 0 1 Starting n d 2 Completed I
34. D 4 start index r55 DF TREAD 5 log size r5 DF TREAD 6 size r58 DF TREAD 7 pointer to the FM of thread join threads r7 DF TREAD 8 the end index for this part uint64 t r32 r10 r55 r32 r5 uint64 t r30 r10 r32 r26 0 needed for calculating current element sum r34 0 needed for calculating current element counter uint64 t r44 DF TSCHEDULE 1 calc curr el 12 DF TWRITE r2 r44 1 DF TWRITE r3 r44 2 B DF TWRITE r4 r44 3 DF TWRITE r32 r44 4 index for A DF TWRITE r30 r44 5 index for B DF TWRITE r5 r44 6 size DF TWRITE r26 r44 7 DF TWRITE r34 r44 8 DF TWRITE r10 r44 9 start index DF TWRITE r58 r44 10 pointer to the FM of thread join threads DF TWRITE r7 r44 11 the end index for this part DF TWRITE r55 r44 12 log size DF TDESTROY void calc curr el void DF TLOAD 12 uint64 t r2 DF TREAD 1 r3 DF TREAD 2 B r4 DF TREAD 3 r32 DF TREAD 4 index for A r30 DF TREAD 5 index for B r5 DF TREAD 6 size r26 DF TREAD 7 r34 DF TREAD 8 r10 DF TREAD 9 start index r58 DF TREAD 10 pointer to the FM of thread join threads r7 DF TREAD 11 the end index for this part r55 DF TREAD 12 log size uint64 t A B uint64 t r2 r2 contains th
35. D PARTY RIGHTS TERAFLUX SHALL NOT BE LIABLE FOR ANY DIRECT INDIRECT INCIDENTAL SPECIAL OR CONSEQUENTIAL DAMAGES OF ANY KIND OR NATURE WHATSOEVER INCLUDING WITHOUT LIMITATION ANY DAMAGES ARISING FROM LOSS OF USE OR LOST BUSINESS REVENUE PROFITS DATA OR GOODWILL ARISING IN CONNECTION WITH ANY INFRINGEMENT CLAIMS BY THIRD PARTIES OR THE SPECIFICATION WHETHER IN AN ACTION IN CONTRACT TORT STRICT LIABILITY NEGLIGENCE OR ANY OTHER THEORY EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 6 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 Executive Summary In this report we provide a description of the integration activity through the COTSon simulation platform of the research of the TERAFLUX partners as progressed during the third year of the project Thanks to the common simulator tools and internal dissemination partners have been also able to transfer their respective research knowledge to the other partners The support for T instructions has been implemented in the simulator this means that partners are now able to run actual benchmarks containing the DATAFLOW Instruction Set Extension T ISE designed in the previous period of the project The Thread Scheduling
36. Fig 3 Fibonacci 35 number of threads zoomed detail of the previous Figure 2 4 4 T Timing Model Currently the TSU implementation already provides functional execution for all T instructions In this section we describe the implementation efforts for the timing model within the simulator which assumes the baseline architecture described in D6 3 for the TERAFLUX DTS Distributed Thread Scheduler For explaining the current methodology we assume the existence of a component still under research in the Architecture workpackage which is the DF Frame cache Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 18 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 The implementation of the timing model is organized as shown in the Fig 4 The execution flow is managed as follows e During the execution T instructions and memory accesses are dropped from SimNow into COTSon e The Filter component filters all T memory accesses to DF Frames by passing them to the TSU in order to model the DF Frame cache DF Frame memory and all queue structures All other instructions i e the ones that are part of the regular x86_64 ISA are passed directly to the COTSon Timer which already implement the timing model for non T instructions
37. LUX evaluation platform The APPS block represents the applications that researches can feed to the evaluation platform as well as other pipe cleaner benchmarks like the ones described in Section 2 3 of this document the ones coming from the activities of WP2 Another important point emerged by the WP2 is a proper choice of the inputs in order to be able to show the performance at the TERADEVICE level 1 for at least 1000 complex cores as discussed in previous deliverables like D7 1 D7 2 D7 3 i e 1000 x 10 transistor devices The TERAFLUX evaluation platform is the set of common tools available to partners the extended simulator i e the extended COTSon see sections 2 2 2 4 2 8 2 10 and 2 11 compilers see section 2 9 the hardware for hosting simulations see section 3 1 and external tools for power estimation see section 2 6 or to easily configure and run the simulator see section 3 2 The output block represents the outcome of the benchmarks while the performance metrics are the set of statistics that can be obtained when executing benchmarks in the common platform see sections 2 4 and 2 5 Finally in this context the app output is necessary for verifying the application had executed correctly during the evaluation Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 10 of 50 Project TERAFLUX Exploiting dataflow parallel
38. NG ON A FOUR CPU MACHINE sssssessceccccesesssseeceecesssesseeeececescesesaseeeeeeeeeesaseaees 25 FiG 10 PERFORMANCE DEGRADATION OF FIBONACCI 40 USING THREAD FAILURE INJECTION WITH FAILURE RATES PER CORE OF 10 5 IND sl OOS ERI EC RR RUE MER cL NM Em ME 29 FIG 11 EXTERIOR VISION OF THE DL PROLIANT DL585 MAIN TERAFLUX SIMULATION SERVER FIG 12 HOST VERSUS VIRTUAL SYSTEM eee tenete tree e EN IRR ee E PER Ae Se ee 33 FIG 13 NUMBER OF VIRTUAL CORES VS MEMORY UTILIZATION IN HP PROLIANT DL585 G7 SERVER 1 TB MEMORY 64x86 64 33 34 15 EXECUTING PIKE IN VERBOSE MODE eene s 35 14 ExECUTING PIKE IN SILENT MODE 16 SIMNOW INSTANCE WITH TEST EXAMPLE SINGLE SIMULATION 2 37 cium omo mom m 17 Two SIMNOW WINDOWS IN CASE OF MULTIPLE SIMULATION PIKE RUN 1 38 18 THE CONTENT ASSISTANT PLUG IN LISTING THE AVAILABLE DDM KEYWORDS cseteris 39 19 THE CONTENT ASSISTANT PLUG IN FILTERING THE DDM KEYWORDS STARTING WITH DVM FOR THE SCHEDULING POLICY FIELD OF THESTHREAD PRAGMA 3 55 eir tU pbi oon d edet icon teet bale DIES 40 FIG 20 THE SIDE PANEL PLUG IN IMPORTED TO THE ECLIPSE PLATFORM sscscsscsssscsscssesesecsesscsecsesscsessesscsscaecsesscascaesecseeaeees 40 FIG 21 THE SIDE PANEL PLUG IN SHOWING A DROP DOWN LIST FOR THE OPTIONS OF THE SCHEDULING MODE 41 FIG 22
39. Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 SEVENTH FRAMEWORK PROGRAMME THEME FET proactive 1 Concurrent Tera Device SEVENTH FRAMEWORK Computing ICT 2009 8 1 PROGRAMME PROJECT NUMBER 249013 TERAFLUX Exploiting dataflow parallelism in Teradevice Computing D7 4 Report on knowledge transfer and training Due date of deliverable 31 December 2012 Actual Submission 20 December 2012 Start date of the project January 1 2010 Duration 48 months Lead contractor for the deliverable UNISI Revision See file name in document footer Project co founded by the European Commission within the SEVENTH FRAMEWORK PROGRAMME 2007 2013 Dissemination Level PU PU Public PP Restricted to other programs participant including the Commission Services RE Restricted to a group specified by the consortium including the Commission Services CO Confidential only for members of the consortium including the Commission Services Change Control Version Author Organization Change History 0 1 Marco Solinas UNISI Initial template 1 0 Marco Solinas UNISI UNISI parts 1 2 Marco Solinas UNISI Added contributions from partners 2 1 Roberto Giorgi UNISI Final revision 3 0 Marco Solinas UNISI Executive Summary and Introduction
40. THE SIDE PANEL PLUG IN AUTOMATICALLY CLOSING THE DDM PRAGMAS csscscescsseseecseescsscscsccsecsesecseesecsessesneaeees 41 FIG 23 THE SIDE PANEL PLUG IN SHOWING THE PROPERTIES OF A SELECTED PRAGMA scscessssescscscesesscscsscsscsesscssesesecaecsseaeees 42 FIG 24 DATAFLOW GRAPH FOR THE BLOCKED MATRIX MULTIPLICATION ALGORITHM sssessssssesscseesesscsesessesescseescseessesseaeees 45 Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 3 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 Glossary Auxiliary Core A core typically used to help the computation any other core than service cores also referred as TERAFLUX core BSD BroadSword Document In this context a file that contains the SimNow machine description for a given Virtual Machine CDG Codelet Graph CLUSTER Group of cores synonymous of NODE Codelet Set of instructions COTSon Software framework provided under the MIT license by HP Labs DDM Data Driven Multithreading DF Thread A TERAFLUX Data Flow Thread DF Frame the Frame memory associated to a Data Flow thread DVFS Dynamic Voltage and Frequency Scaling DTA Decoupled Threaded Architecture DTS Distributed Thread Sched
41. TREAD 5 size of the result matrix C tt df tstamp tsl tt END TIMING DF TDESTROY df exit free uint64 t r4 Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 50 of 50
42. Torrie E Pal Singh J and Gupta A The SPLASH 2 programs characterization and methodological considerations In Proc of the 22nd annual international symposium on Computer architecture ISCA 95 ACM New York NY USA 24 36 COTSon09 Argollo E Falc n A Faraboschi P Monchiero M Ortega D Cotson infrastructure for full system simulation ACM SIGOPS Operating System Reviews January 2009 43 52 61 2009 D72 Giorgi R et al D7 2 Definition of ISA extensions custom devices and External COTSon API extensions Giorgi96 Giorgi R Prete C A Prina G Ricciardi L A Hybrid Approach to Trace Generation for Performance Evaluation of Shared Bus Multiprocessors IEEE Proc 22nd EuroMicro Int 1 Conf EM 96 ISBN 0 8186 7487 3 Prague Ceck Republic Sept 1996 pp 207 214 Giorgi97 R Giorgi C A Prete G Prina L Ricciardi Trace Factory Generating Workloads for Trace Driven Simulation of Shared Bus Multiprocessors IEEE Concurrency ISSN 1092 3063 Los Alamitos CA USA vol 5 no 4 Oct 1997 pp 54 68 doi 10 1109 4434 641627 Giorgi07 Giorgi R Popovic Z Puzovic N DTA C A Decoupled multi Threaded Architecture for CMP Systems Proc IEEE SBAC PAD Gramado Brasil Oct 2007 pp 263 270 Giorgi12 Giorgi R Scionti A Portero A Faraboschi P Architectural Simulation in the Kilo core Era Architectural Support for Programming Languages and Operating Systems ASPLOS 2012 poster pres
43. ULT INJECTION TECHNIQUE COTSON MODULES eene 28 2 12 TRANSACTIONAL MEMORY SUPPORT IN COTSON UNIMAN c cccssssccssseeeeeseeeecesseeesseaeeeseeeaeeeeees 30 2 12 1 Functional Transaction 30 2 12 2 Adding timing support With COTSon 30 DEVELOPMENT AND SIMULATION ENVIRONMENT AND SUPPORTS 32 3 1 THE TFX3 TERAFLUX SIMULATION 32 3 2 PIKE AUTOMATIZING LARGE SIMULATIONS UNISI esee enne nnne 34 3 25 Overall Organizations iet rette iret ete neo 34 J3 2 2 FUNCTIONS Exposed to the User mitre rei ette terea sero re e ere Fee re aee ee eeu ra agua 35 3 2 3 Cukrentlimils eei t e a epe 36 3 245 Examples ed ate e OE PR PIER Rd Dee ERE RU Rede en rer Ve eodd 36 3 3 THE ECLIPSE MODULE FOR TFLUX eene nennen enne nnne senes senes 39 331 The Content Assistant PIug AR iere epitope eee tot he erepto de Pew uero re pa PPP Pao eoe 39 3 3 2 The Side Panel Plug 1n iii ie a rr e ttd art a E Pr be nene het 40 3 4 SUPPORT TO THE PARTNERS FOR IMPLEMENTING COTSON EXTENSIONS 43 Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 2 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevi
44. Unit provides full support for the execution of TSCHEDULE TDESTROY TREAD and TWRITE variants of these basic instructions are also implemented in the simulator in order to meet some compiler needs highlighted by the partners working on WP4 An interface for injecting directly such T built ins in C applications is also available and in this report we provide the description of some first kernel benchmarks 1 the Recursive Fibonacci and Matrix Multiply exploiting this feature The support of the GCC compiler for generating executable T binaries directly from OpenStream annotated C code is also available to partners and applications ready to compile are also published in the public repository Finally the support for multimode Transactional Memory is implemented in the simulator and available to all the Partners and publicly available for download and run We believe that all the above will enhance the capability of the research community to simulate Teradevice systems The multi node Distributed Thread Scheduler DTS a key element of the TERAFLUX Architecture has been also implemented in COTSon and is also publicly available for downloading and running experiments In this report we show how the very same T application binaries running on the single node configuration have been also successfully run in a multi node system This implementation of the multi node DTS currently encompasses the functional implementation and a partial timing
45. ach with the respective log and output files stored in the PIKE log directory and identified by an alphanumeric code These simulations use physical cores on the host machine through a thread pool mechanism Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 37 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 File View Special Keyboard Help 3j 5C Numeric Display s Simulator Stats IDE Primary Display r IDE Secondary Display Diagnostic Ports Floppy Display 20 81 Host Seconds 962 560 master read 0 master read 00 00 96 FB 83 80 0 read 21 25 Sim Seconds 131 072 master written 0 master written 00 00 00 87 84 0 written 11 6 Avg MIPS BEEN 0 slave read 0 slave read 00 00 00 00 e3 eo 0 07 MIPS ee slave written 0 slave written DMA PIO mode PIO PIO mode access official Ubuntu documentation please visit http help ubuntu com oot cotson xget data cluster sh b sh b ill etc hosts son_network_restart stopping all network commands generating etc network interfaces cleaning etc resolv conf baptize nic setting hostname start networking Restarting OpenBSD Secure Shel server sshd rint list file y_test sh node_conf ig node_conf ig log Set proxy
46. achine architecture description Hence for each heartbeat the McPAT tool extracts a power consumption estimation As shown in in the case of the on line power estimation the set of power estimation values is stored back in the database This allows the TSU to properly schedule the DF Threads in order to respect the power temperature and reliability see also Section 2 10 in this deliverable and Deliverable D5 3 constraints and their correlation with power consumption Similarly to the off line approach the XML configuration file is generated by the McPAT configuration generator script at every heartbeat Finally in this case the same set of power consumption values can be used to respect the power profile of the simulated machine Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 23 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 2 7 Execution of User Level on COTSon UCY Within the context of WP7 we have been working on the execution of DDM applications using our user level DDM TSU runtime With our first implementation reported earlier we were able to execute on single node COTSon instances Within this year we have extended the TSU to support execution on distributed systems Our first attempt to execute on a multi node
47. achine must be increased In particular if we want to simulate a Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 19 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 system with say 1024 cores target for this project as presented in previous deliverables and in particular in D7 2 we may need at least 32 nodes Extending COTSon in order to allow many node simulations with T support has been performed by UNISI with the support of HP Currently the TSU model is able to perform thread scheduling among many nodes It has to be tuned up by connecting the timing models of the several existing components like caches memory to the TSU models We plan to complete the multi node case in the next year In the following we provide some insights on the framework and show preliminary results of Fibonacci and Matrix Multiplier running on target machine up to 1024 cores DF Frame Information Shared Memory sc sc 5 P IP Circular Queue Ready Queue Scheduler SS922e 14 40 sseooe 9ue1J 4 message queue message queue DF Frame Memory Allocation Deallocation Manager Monitor Fig 5 The structure of the framework for multi node simulation as it is running on our simulation host
48. and run it Another limitation of the current version of PIKE is the impossibility to redirect and control the benchmark output file 1f any for example to copy it from guest to host PIKE uses the most recent version of COTSon to work If the COTSon installation directory is not present neither in the configuration file nor in the WorkingDirectory PIKE will download and install it on a specific folder WorkingDirectory This technique allows having a number of independent working environments PIKE is strongly coupled to COTSon it is a wrapper of the simulator Consequently if the simulator has bugs PIKE automatically inherits them 3 2 4 Examples In the PIKE installation folder there is also an example of the configuration file called pike example conf Running this example uses the script binary test sh that prints to standard output a given parameter always specified in configuration file Fig 16 shows the SimNow console running this test example for a single simulation It is possible to use this example to test a single node If the user wants to run more sophisticated multi node simulations like using MPI or other multi node simulation options he she may use custom SimNow HD images like debian img Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 36 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Num
49. and sets the no operation function in the register and a reserved region of memory in register pstack The no operation function is used to optimize the simulation idle polling loop The reserved stack is used to materialize the local frame by tload see above so that it can be used by standard x86 load and store operation by the compiler Used in the run time and not in the dataflow programs e treset fors rn resets the dataflow execution freeing all threads and preparing for a new execution The register rs points to a string in memory of length stored in register for simulation debugging purposes And finally these instructions are used for debugging and tracing of execution statistics Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 15 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 e tstamp ts buf collects per core stats instr cycles idles in the memory pointed to by register buf and returns the current value of simulation nanos in reg ts Can be used to address execution statistics in a much more precise way than using performance counters from a guest program 2 3 New T Benchmarks UNISI By exploiting the T ISE support for the C language introduced in the section 2 1 new benchmar
50. ber 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 1 AMD SimNow Main Window Public Release su tfxa promana unisi it File View Specia Keyboard Help mm mE Numeric Disalzy s Simulatcr Stas IDE Primary Display p IDE Secondary Display Diagnos ic Ports r Hoppy Display 15 52 Hest Seconde 962 560 master rezd 0 mactertesd 00 00 96 82 80 0 read 20 72 Sim Seconds 181 072 masterwriten 0 masterwrites 00 00 20 00 87 84 0 written 155 Avg MIPS Feset Avg 0 sleve read slave read oo ez eo antes 0 slave witten 0 slave witten DMAPIO mode PIOPIO mode To access official Ubuntu documentation http help ubuntu comz root cotson xget data cluster sh b sh Pill setr hasts otson nrtuork rcstart stopping all network conmands generating etc netuork interfaces cleaning fetc resolu conf baptize nic setting hostname start networking Restarting OpenDSD Secure Shell server esting script parameterl unning 2 Fig 16 SimNow instance with test example single simulation Fig 17 shows two SimNow windows that are opened when PIKE is executed with the same binary file binary_test sh using two different applications customized per node simulation binaryl and simulation_binary2 Two different simulations are running e
51. c 4 3 S 0 1 2 5 3 01 RCU 120 11 RCU 0 r21 3 22 INIT 124 iag 3 tarting Get my ID tarting Get my ID Finished Execution tarting Sync tarting Sync PSerial 0 000088 D ompleted Sync M Sunc PTotal 0 017089 3 0 Starting Execution tarting Execution ata Verified E lFinished Execution Finished Execution eIdle 01 0 014733 ata Uerif ied BExecution 0 016782 Idle 31 0 019829 inished luser node 0000 modZamnmp ismpsyiser node 0000 modZammpismpssMicroS _ Running Running Running Running Fig 8 Running DDM on COTSon with four nodes Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 24 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 ES 1 AMD SimNow Main Window Public Release tfx2 x File View Special Keyboard Help Numeric Display s r Simulator Stats 4 r IDE Primary Display r IDE Secondary Display Diagnostic Ports Floppy Display 17 70 Host Seconds 1 142 784 master read 0 master read FB 83 80 0 read 27 03 Sim Seconds 176 128 master written 0 master written 00 00 00 00 87 84 0 written 1 68 Avg MIPS Reset Avg 0 slave read 0 slave read 00 00 e3 e0 0 01 MIPS 0 slave written 0 slave writ
52. can be specified in the PIKE configuration file The list of main sections of the configuration file is reported in Table 4 Table 4 Structure of the PIKE configuration file system Allows us to specify a custom path for PIKE listed in Table 1 ppropriate links to any SimNow ISO images will be automatically created in the COTSon data directory log Allows us to specify the output directory of the log produced by the simulation together with the names for the output files if needed If such Inames are not customized PIKE creates log files using an alphanumeric code as simulation s identifier custom Hard Disk image file if any lof nodes number of cores and the size of the ram software Software packages to be installed on the guest before running the simulation The COTSon mediator is used to provide ethernet based connection among the simulated nodes PIKE supports both deb and rpm based packages parameters ELE eee to run and parameters For each entry a different simulation will be launched Each run will be identified by a different alphanumeric code cache and configuration ediator configuration inside the simulation 3 2 3 Current limits PIKE currently does not allow complete control over the timing options of the simulations It does not allow the execution of too complex benchmarks like those that need an ad hoc installation process rather than loading a single executable binary
53. ce Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 3 5 TUTORIAL SESSIONS ON OMPSS OPEN TO THE PARTNERS BSC essen 43 werte Ee 45 LIST OF FIGURES FIG T TERAFLUX EVALUATION PLATFORM MEN Sa EXER OU XN VE RR c PU RES 10 FIG 2 FIBONACCI 35 NUMBER OF THREADS IN FOUR SINGLE NODE CONFIGURATIONG ccssscccesssecceesssccessseeccesseeceesseaeeeeses 17 FIG FIBONACCI 35 NUMBER OF THREADS ZOOMED DETAIL OF THE PREVIOUS FIGURE eese 18 FIG 4 TIMING MODEL FOR THE T EXECUTION sscccessssccessseceesseececssseccesseeeecesuaeeceessseecessseecessseeeessaeecessaseccesseeeceeeseeeesea 19 FIG 5 THE STRUCTURE OF THE FRAMEWORK FOR MULTI NODE SIMULATION AS IT IS RUNNING ON OUR SIMULATION HOST 20 FIG 6 MULTI NODE SIMULATION FIBONACCI WITH INPUT SET TO 40 AND MATRIX MULTIPLY WITH MATRIX SIZE 512x512 PARTITIONED IN A NUMBER OF BLOCKS EQUAL TO THE NUMBER OF CORES sssssssccceessesseeeeceseeesssseeeceeseseseaaaaaeeeeeeeseeaes 21 FIG 7 POWER ESTIMATION SAMPLE OUTPUTS sss essere 23 FIG 8 RUNNING DDM ON COTSON WITH FOUR NODES cceeeeeeeeeeenenn nennen nenne serene sess aeneis se ssa een ness sanae nna 24 FIG 9 BLOCKED MATRIX MULTIPLY RUNNI
54. e Additionally partner HP extended the baseline TSU implementation to support speculative thread creation Speculative thread creation means that dthread objects created by a potentially faulty thread are tagged as speculative and can be discarded when the FDU detects a fault in the parent thread To enable the elimination of speculative threads in case of a fault each speculative thread stores the parent ID of its creator in its dthread object As described in Deliverable D5 3 the write buffer and the speculative thread creation are required to ensure the execution without side effects and therefore enable Double Execution and Thread Restart Recovery Our Double Execution implementation and Thread Restart Recovery mechanism fully support the T instruction set as described in Deliverable D6 2 Double Execution and speculative thread support can be activated in the COTSon configuration file by setting the following options options tsu_speculative_threads true Activate speculative thread creation double_execution true Activate Double Execution 2 11 High Level Fault Injection Technique COTSon Modules UAU To investigate the thread execution performance of Double Execution and Thread Restart Recovery mechanism in presence of failures we extended the baseline TSU implementation by a failure injection mechanism Currently the failure injection mechanism assumes a constant failure rate per core However we will incorporat
55. e Lua language further configuration of some scripts is recommended to allow more control of simulated events for example to set any specific option e g MPI or specify features such as the definition of a region of interest or even output of the simulation in a file stored in the host machine In addition this work should be done for each parameter of the benchmark used PIKE can be run in two different modes silent the simulation steps are shown and verbose a debug mode in which every single operation performed by PIKE is traced Fig 14 shows an example of the information provided by pike when it is executed in silent mode and Fig 15 depicts the execution of pike in verbose mode mondelli tfxa pike4cotson code ruby W0 I lib bin env installer pike conf test 1 START Configuration file pike conf Destination dir home mondelli pike env test 1 Read configuration Load configuration file for simnow Fetch simnow source wait Uncompress simnow source Load configuration file for cotson etch cotson source wait custom file in Cotson force overwrite old Fig 14 Executing PIKE in silent mode The purpose of PIKE is to automate the simulation configuration and execution generating all Lua files and scripts suitable for benchmark execution In addition it allows the user to use all available host cores and enables simulation in batch mode by means of a thread pool mechanism creat
56. e address of the first element of matrix A uint64 t r3 r3 contains the address of the first element of matrix B uint64 t r28 A r32 r29 B r30 r26 r28 r29 current part of the sum r30 r5 r32 4 r34 if current element is the last for this sub block schedule move_to_next_el else schedule calc curr el uint8 t cnd r34 r5 uint64 t r44 DF TSCHEDULE move to next el 12 r44 DF TSCHEDULE cnd calc curr el 12 DF TWRITE r2 r44 1 DF TWRITE r3 r44 2 B DF TWRITE r4 r44 3 DF TWRITE r32 r44 4 index for A DF TWRITE r30 r44 5 index for B DF TWRITE r5 r44 6 size DF TWRITE r26 r44 7 current part of the sum Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 49 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 DF_TWRITE r34 r44 8 T DF TWRITE r10 r44 9 start index DF TWRITE r58 r44 10 pointer to the FM of thread join threads DF TWRITE r7 r44 11 the end index for this part DF TWRITE r55 r44 12 log size DF TDESTROY void move to next el void DF_TLOAD 12 uint64_t r2 DF_TREAD 1 A r3 DF TREAD 2 r4 DF TREAD 3
57. e more complex failure distributions in the last year of the project From the constant failure rate Koren07 we derive the reliability of a core at time At with R At e where At is the duration since the last failure occurred in this core A core s reliability value is updated when the core executes a thread and issues a TDESTROY or TWRITE instruction The TSU subsequently generates a random number with 0 lt rand lt 1 and verifies whether the core has suffered from a defect bool faulty random gt reliability Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 28 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 We distinguish between two failure injection modes e Bit flip failures and e thread failures For the bit flip injection the TSU determines the reliability on each TWRITE and checks whether the core became faulty If the core has suffered from a fault one random bit within the TWRITE parameters is flipped For the thread failure injection the TSU checks the reliability during each TDESTROY operation If the core has suffered from a defect during thread execution the TSU tags the thread as defective and starts recovery actions The failure injection functionalities are managed i
58. e pragmas that are available to the user to use A user can insert a specific Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 40 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 pragma by just clicking on an item of the Sample View list The Property list as the name suggests contains the properties of each pragma along with the available parameter values Sample View 22 EST Kernel Start Program End Program 3 Private Variable Block End Block Thread X Property Value thread number 1 sched_mode DVM_DYNAMIC sched value DVM DYNAMIC readycount number DVM STATIC number DVM_RROBIN 4 import var DVM MODULAR DVM CUSTOM addr size flag expression reference variable export var import export var The scheduling mode determines how data will be scheduled to the processors DVM DYNAMIC E DVM STATIC DVM RROBIN DVM MODULAR b Fig 21 The side panel plug in showing a drop down list for the options of the scheduling mode An example is shown in Fig 21 where the thread pragma is selected at the Sample View list and the Property list shows its properties such as thread number scheduling mode and value ready count value etc Automatically closing pragmas if possible ddm startprogram
59. e were addressed by AMD and fixed in an NDA version of SimNow This means that the TM module works correctly with the NDA version but not the current public release The functional TM support has been made available to TERAFLUX partners and the wider community through a branch in the COTSon SourceForge repository Further a subset of the STAMP kmeans vacation benchmarks 15 included for testing and demonstration 2 12 2 Adding timing support with COTSon Timing models for Transactional Memory have been added in a separate branch of COTSon This branch includes models for two TM systems Therefore the relevant contributions are e first is a simple bus based broadcast implementation e The second is a more scalable distributed system This involved adding timing support for a distributed directory based cache coherence protocol e A network model has been implemented in place of the standard bus simulation present in COTSon leading to a more realistic model for large scalable systems To get this version the interested partner has to sign a Non Disclosement Agreement with AMD This has not yet been possible for all Partners Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 30 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2
60. ed according to the characteristics of the host machine 3 2 1 Overall organization PIKE uses a single configuration file to set the parameters of the simulation Such file is used to set 1 the list of simulations to run 2 software configuration like communication type input file name and region of interest 3 hardware properties like cache configuration timing model node number and core number Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 34 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 rie ike_env test_e Fig 15 Executing PIKE in verbose mode Through this single configuration file PIKE produces simulation output and statistics inside a specified folder which we refer to as the WorkingDirectory The PIKE configuration requires the user to specify the path to the directories listed in Table 2 bin Contains all the benchmarks binaries usually compiled on host machine atann config Contains the config file for simulation currently this directory must contain eee ROM file eventually specified in the configuration install it automatically image Directory that contains the optional ISO images for SimNow and any files mel Posee fromage BSD i on T
61. f a set of 15 thoroughly commented examples illustrating all the features of the language The applications ported to OpenStream in WP2 have been distributed together with the OpenStream source code They have also been packaged as stand alone benchmarks with multiple data sets and auto tuning scripts to facilitate the adaptation of the grain of parallelism to the target The current list of distributed OpenStream programs is cholesky fmradio seidel fft 1d jacobi strassen fibo knapsack matmul bzip2 SPEC CPU 2000 and ferret PARSEC For some of these programs multiple versions are provided to compare data flow style Cilk join style and barrier style implementations OpenStream applications are supported by the software run time implementation of T In addition most applications run on COTSon when compiled using the hardware ISA branch of the TERAFLUX compiler i e the SourceForge public repository SF The only problematic ones are the Cilk join and barrier style variants of the benchmarks that make use of the lastprivate or taskwait constructs of OpenStream These currently cannot be implemented using T the compiler makes use of scheduling and stack manipulation mechanisms not supported by the tsu2 branch of COTSon This is not a major issue as the data flow style programs compile and run properly but for completeness and to facilitate the implementation of larger applications we are working on an extension of the T ISA t
62. gi96 Giorgi97 especially in the many core era With the aim of providing a tool characterized by a high simulation speed and accuracy for a heterogeneous kilo core architecture integrating an accurate network on chip simulator the TERAFLUX project adopts a framework based on the COTSon COTSon09 infrastructure Compared with current state of the art simulation platforms this approach offers a complete environment for a many core full system simulation and for its power consumption estimation In order to guarantee fast simulations COTSon implements a functional directed approach where functional emulation is alternated to a complete timing based simulation The result is the ability of supporting the full stack of applications middleware and OSes The modular approach on which COTSon is based allows us to adopt the proprietary AMD SimNow SimNow09 as emulator Finally the integration of the proposed framework with the McPAT tool MCPATO09 provides the ability of estimating power consumption Fig 11 Exterior vision of the DL Proliant DL585 main TERAFLUX simulation server 3 1 The tfx3 TERAFLUX Simulation Host The host machine that we selected as Consortium wide simulation host and its cost was initially planned in our Annex 1 is shown in Fig 11 This is the computer where we run the simulated virtual processor and the guest machine as the simulated machine We verified that such platform is able to support the simulation of a
63. he mul_thread_end thread is responsible for writing the fake value to the frame memory of the join_threads Please note that the main ep 1 must be separated from the main thread since the latter schedules one instance of the join threads but main ep 1 is scheduled np times since it must schedule np mul thread one for each sub block the mul thread and mul thread next el can t be merged the former calculates the bound indexes for the current sub block while the latter is scheduled for each element of the sub block i e s np times the calc curr el can t be merged with the mul thread next el the first performs the multiplication and sum operation needed for computing the current element and this it is scheduled for each term of the sum i e pair of elements read from A and B while the second only once for each element the move to next el must be separated from the calc curr thread because it must check for the current sub block completeness once the current element of the sub block has been successfully calculated the mul thread end can t be merged with the move fo next el because it is scheduled once for each sub block it is responsible for the fake write in the join threads frame while the other is scheduled once for each element of the sub block In the following we also list the mmul c code for completeness define define define define define define define stat TSU PRELOAD
64. implementation process Partners contacted HP members directly or even via the COTSon forum and received a quick answer to their requests suggestions doubts etc This has been a very relevant contribution to all partners and it should appreciable throughout this document 3 5 Tutorial Sessions on OmpSS Open to the Partners BSC The StarSs programming model is the proposal from BSC in TERAFLUX to provide a scalable programming environment to exploit the dataflow model on large multicores systems on a chip and even across accelerators StarSs can be seen as an extension of the OpenMP model Unlike OpenMP however task dependencies are determined at runtime thanks to the directionality of data arguments The StarSs runtime supports asynchronous execution of tasks on symmetric and on heterogeneous systems guided by the data dependencies and choosing the critical path to promote good resource utilization The StarSs also named OmpSs tutorials have also covered the constellation of development and performance tools available for the programming model the methodology to determine tasks the debugging toolset and the Paraver performance analysis tools Experiences on the parallelization of real applications using StarSs have also been presented Among them the set of TERAFLUX selected applications in WP2 have been ported to StarSs and made available to the partners Such training and tutorials have been given at TERAFLUX meetings and related summe
65. ing 16 x86 64 cores based on AMD Opteron L1_JH FO 800Mhz architecture and 256M DRAM per core Fig 12 depicts the system host and guest systems coo COl AW i H 2 C24 C25 gt 1 64 OPTERON CORES E m 00 cor C02 C07 E u C15 i i a 8 RS S ET t t C63 bi I i DEDUCITUR gt a A virtualized x86 64 Core Virtualized Memory Controller Virtualized Network Interface Fig 12 Host versus Virtual System Each node runs a Linux operating system On top of this system we are able to run several benchmarks based on both OpenMP and MPI programming models One of the main modifications we did is the implementation of the DF Thread support Porterol1 Giorgil2 072 101 Giorgi07 through the ISA extension DF Threads enable a different execution model based on the availability of data and allow many architectural optimizations not possible in current standard off the shelf cores Breakdown of Host Memory Utilization Blfree memory S used memory emory Usage Utili 8 x gt 10 4128 129 6176 193 7040 220 Number of x86_64 virtual cores nodes Fig 13 Number of Virtual Cores vs Memory utilization in HP ProLiant DL585 G7 Server 1 TB Memory 64 x86_64 cores We can still double the number of virtual nodes from 64 to 128 one master node and 128 slaves resulting in a 40 usage of the DRAM memory in the
66. ion faster This configuration was compiled on the tfx2 machine see above 2 9 GCC Backend and OpenStream Experiments on COTSon INRIA The TERAFLUX backend compiler has been maturing over the course of the third year of the project It compiles OpenStream programs data flow streaming extensions of OpenMP to T intrinsic functions themselves compiled to the T ISA The code generation pass has been developed as a middle end pass in GCC 4 7 0 operating on three address GIMPLE SSA code The traditional Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 25 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 compilation flow is being modified according to a specialized adaptation of the built in based late expansion approach described in D4 2 first year deliverable See also Li12 1412 Built ins are used both to convey the semantics of input and output clauses in streaming pragmas to the compiler middle end and to capture the semantics of efficiency languages such as HMPP StarSs OMPSs and TFLUX More details can be found in Pop13 and Deliverable D4 1 As part of the training and internal dissemination activities a step by step OpenStream tutorial has been designed and distributed with the OpenStream repository It consists o
67. iply benchmark we start to see some loss of scalability after 512 cores this is due to the lack of parallelism as we choose too small a data set for this experiment As a side note we can see that the simulator is also able to catch such behaviors 2 6 Power estimation using McPAT UNISI Power estimation along with temperature and reliability is an important metric that enables the envisioned architecture to schedule DF Threads with the aims of improving the overall resiliency of the system This has been extensively discussed in the previous deliverable D7 3 Here we briefly describe how this mechanism has been extended from an off line to an on line methodology This is necessary to drive the scheduling actions during the program execution Looking at the simulation level power estimation is obtained with the use of an external tool called McPAT MCPATO09 McPAT has been developed by HP with the ability of estimating power consumption timing and area of a given microarchitecture Specifically McPAT implements an internal model to compute the power consumption based on the activity within the modeled microarchitecture The activity refers to the instructions executed by the modeled systems and in Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 21 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call
68. ism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 2 2 T Instruction and Built In Support in the C Language UNISI HP In the TERAFLUX project the T Instruction Set Extensions ISE to the x86 64 ISA has been introduced for managing threads in a dataflow style by means of dedicated hardware units for executing the custom instructions In order to experiment with these T new instructions we used a simulation mechanism which overloads a set of unused existing x86 instructions thus allowing us to rely on very well tested virtualizer like SimNOW part of COTSon In order to simulate this feature in COTSon and have more flexibility in the register mapping of the compiler we overload the semantic of a particular x86 64 instruction called prefetchnta This has the advantage of being a hint with no architecturally visible side effect and does not clobber any architectural register From the x86 64 instruction manual x86 prefetchnta m8 where m8 is a byte memory address respecting the x86 64 indexed base offset format x86 Chapter 2 This instruction is harmless to the core execution since it is just a cache hint that s why we selected it as the mechanism to convey additional information into the simulator It is also rich enough to support a large encoding space as well as immediates and registers for T instructions as we describe in more detai
69. ks have been implemented for running in the COTSon simulator The Matrix Multiplier benchmark is already available in the COTSon repository while the Radix Sort benchmark is going to be released in the near future 2 3 1 Matrix Multiplier The matrix multiplication algorithm chosen for the T C like implementation is the blocked matrix multiplication version in which the result matrix C A B is recursively constructed as 5 C Ags i B cb 1 where C represents a sub block of the result matrix The input matrices are required to be square for simplicity and defined as At The input parameters that the algorithm needs for execution are two integers s and np both required being power of 2 e s number of rows and columns of the square matrices A B and e total number of partitions blocks For example running the application with 232 and np 4 will perform a multiplication of 2x2 blocked matrices in which each block is composed by 6x 6 elements Details on the structure of the dataflow version of this benchmark are reported in Appendix A The source code of the matrix multiplier algorithm is available to the TERAFLUX partners in the public SOURCEFORGE website SF We report the code for quick reference in the Appendix A 2 3 2 Other Benchmarks Dataflow version of the Recursive Fibonacci application has been implemented in C using the built ins introduced in Section 2 2 similarly
70. lar we highlight the following relations e M7 1 WP7 for the first architectural definition e 2 1 02 2 WP2 for the definition of the TERAFLUX relevant set of applications e D4 1 04 3 WP4 for the compilation tools towards T e D5 1 05 2 05 3 for FDU details e D6 1 06 2 D6 3 WP6 architectural choices taken during the first 3 years of the project e D7 1 07 2 07 3 WP7 previous research under this WP Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 8 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 1 2 Activities Referred by this Deliverable This deliverable reports on the research carried out in the context of Task 7 1 m1 m48 and Task 7 3 m6 m40 In particular Task 7 1 covers an ongoing activity for the entire duration of the project that ensures the tools are appropriately disseminated and supported within the consortium see Annex 1 page 52 while Task 7 3 is related to the implementation in the common evaluation platform of the fault injection and power models see Annex 1 page 53 1 3 Summary of Previous Work from D7 1 D7 2 and D7 3 During the first two years the TERAFLUX partners started using COTSon and modified it in order to implement test and validate new features to
71. ls below The additional information include the T opcodes and its parameters as introduced in D6 1 D6 2 as well as other T new instructions besides the 6 original ones introduced in D7 1 D6 2 whose need became clearer as we started experimenting with more complex code Moreover this instruction is a good match to the compilation tools because it doesn t alter any content of the general purpose registers For example other user defined functionalities of COTSon and the initial T implementation use CPUID see D7 1 D7 2 which has the unpleasant side effect of modifying RAX RBX RCX RDX which causes compiler complexity and unnecessary spill restore overhead In order to minimize the probability of overloading an instruction used in regular code we selected as MOD R M byte x86 the value 0x84 which means that m8 specifies a 32 bit memory address that is calculated as Sbase index 2 displacement32 The base index register identifiers and the scale bits 2 bits are packed in a so called SIB byte x86 displacement32 is another 4 bytes In such case we have a total of 5 bytes after the opcode and the MOD R M byte that are available for the encoding of T ISE We then defined a magic value 0x2daf as a reserved prefix that indicates a prefetch of 0x2da 0000 766 443 520 bytes of a scaled index and base address which is not something that has any conceivable use in practice As a matter of fact we
72. m part of the thread graphs For each configuration except for the startup and ending phases we observe that there is always a number of running DF Threads equal to number of cores demonstrating that the execution paradigm is always able to load the system 4 8 cores TH WAITING 7 122 2 2 nan eee 14 H TH READY 6 HHH PH Pp L 12 i pot H TH RUNNING 4 5 1 4 810 i HTB lll HEE 3 2 4 2 2 HH HH 4 ryt H PH a 2 0 1 0 2000000 4000000 6000000 8000000 1 07 1 2 07 1 4 07 1 6 07 1 8E 07 2 07 0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000 1 07 Clock cycle Clock cycle 16 cores 32 cores 30g HHH NM 60 4A 4 25 LL M LL NI LLLI NNNM NN 50 920 4 5 5 4 to H 80 e 3 Eas Lee Ie Me M Pene temet N 1 zer 5 10 4 20 5 41 7 H24 H H DAER IDES 10 0 i 0 0 500000 1 06 2E 06 2E 06 06 3E 06 4 06 4E 06 SE 06 5 06 0 500000 1000000 1500000 2000000 2500000 3000000 Clock cycle Clock cycle
73. m the nature of such materials for example references to publicly available forms or documents Disclosure or use of this document or any material contained herein other than as expressly permitted is prohibited without the prior written consent of TERAFLUX or such other party that may grant permission to use its proprietary material The trademarks logos and service marks displayed in this document are the registered and unregistered trademarks of TERAFLUX its members and its licensors The copyright and trademarks owned by TERAFLUX whether registered or unregistered may not be used in connection with any product or service that is not owned approved or distributed by TERAFLUX and may not be used in any manner that is likely to cause customer confusion or that disparages TERAFLUX Nothing contained in this document should be construed as granting by implication estoppel or otherwise any license or right to use any copyright without the express written consent of TERAFLUX its licensors or a third party owner of any such trademark Printed in Siena Italy Europe Part number please refer to the File name in the document footer DISCLAIMER EXCEPT AS OTHERWISE EXPRESSLY PROVIDED THE TERAFLUX SPECIFICATION IS PROVIDED BY TERAFLUX TO MEMBERS AS IS WITHOUT WARRANTY OF ANY KIND EXPRESS IMPLIED OR STATUTORY INCLUDING BUT NOT LIMITED TO ANY IMPLIED WARRANTIES OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIR
74. meet their research needs In particular we are able to boot a 1000 cores machine based on the baseline architectural template described in D7 1 The target architecture can exploit all the features added by the various partners to the common platform this is very important for the integration of the research efforts carried out in the various TERAFLUX WPs In particular an initial FDU interface with the TSU both DTS style and DDM style has been described in D7 2 and further detailed in D7 3 Similarly in D7 3 a first model for the development to monitor power consumption and temperature was reported Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 9 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 2 New Simulation Features 2 1 Brief Overview of the TERAFLUX Evaluation Platform ALL WP7 PARTNERS The TERAFLUX project relies on a common evaluation platform that is used by the partners with two purposes 7 evaluate and share their research by using such integrated common platform and ii transfer to the other partners the reciprocal knowledge of such platform In Fig 1 is shown the high level vision of the evaluation platform TERAFLUX EVALUATION PLATFORM A 3 d cores Fig 1 TERAF
75. model not fully connected with other component timing models The support for power estimation is now integrated in the evaluation platform The Fault Detection Unit FDU subsystem is also implemented in COTSon providing support for double execution of threads and thread restart recovery both in the single node case Moreover in order to test the correctness and effectiveness of the fault detection mechanisms the single node DTS implementation has been extended with a high level fault injection technique which is also described in this deliverable Moreover other Dataflow variants like the Data Driven Multithreading DDM from the UCY Partner have been also tested in COTSon both in the single node and multi node configurations All the newly implemented characteristics have been successfully integrated in the common platform also thanks to the support provided by the HP partner which released COTSon at the very beginning of this project to all the TERAFLUX partners A new tool called PIKE for extending the knowledge details to perform large target machine simulations has been realized and released in the public repository to the TERAFLUX partners and more in general to the scientific community This tool acts as a wrapper of the COTSon simulator and simplifies the configuration process needed for running a set of simulations thus speeding up the evaluation process of newly implemented research solution The originally planned
76. n see D6 2 Table 1 from the frame memory of a DF thread at the slot number 5 The compiler should then target such T built in For testing we also provide a set of C language built ins that can be embedded in manual C code and would be expressed DF_TREAD 5 as shown here a more extensive example is provided in Appendix A for quick reference uint64 t a a DF TREAD 5 This will then be assembled as prefetchnta 0x2daf050e 1 will have a meaning TREAD 5 rdx In fact the corresponding bytes representing the instruction will be OF 18 84 12 05 AF 2D The container of the custom instruction is therefore 0xOF1884 AF2D which is already described above and is the same for all the custom instructions The useful bits underlined are e 0x12 specifies the identifier of the destination register of TREADQI which is connected to the destination variable a by the gcc macro expansion e Ox0E is the T opcode for TREADQI TREAD with immediate value other currently experimented opcodes are reported below e 0x05 this is the immediate value of the DF TREAD Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 12 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1
77. n the COTSon configuration file by activating the core_failure_injection true parameter while the failure rate per second can be adjusted by the core_failure_rate parameter Finally the failure injection mode is selected by the failure_injection_mode parameter Fig 10 shows exemplary the performance degradation induced by thread failure injection and thread restart recovery for Fibonacci 40 More results using the failure injection mechanism can be found in Deliverable D5 3 Performance Degradation of Fibonacci 40 2 24 Failure Rate A e 10 amp 100 s 1 8 4 S J E S 144 124 9 104 w E 084 o 06 4 a 04 4 0249 o m 0 0 k 8 16 32 Cores Fig 10 Performance degradation of Fibonacci 40 using thread failure injection with failure rates per core of 10 s and 100 s Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 29 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 2 12 Transactional Memory Support in COTSon UNIMAN This section describes the implementation of Transactional Memory TM on SimNow and COTSon Although COTSon provides the timing model for our simulations we cannot control the flow of the program as required in TM implementation from within COTSon Correc
78. o support these constructs directly At this time the TERAFLUX memory model is in progress in COTSon A preliminary formal specification exists for Owner Writable Memory OWM see D7 1 regions Gin12 and UNIMAN implemented TM in COTSon but the former is not yet implemented and the latter has not been merged with the tsu2 and tsu3 branches of the simulator As a result only pure data flow benchmarks are currently able to scale to 1024 cores or to run on multiple nodes in general Unfortunately the current compilation flow for OpenStream makes use of intermediate proxy data structures for run time dependence testing This is necessary to implement the sliding window semantics of the language s streams and to support the rich region based dependences of StarSs OMPSs Because of this OpenStream programs currently run on a single node only and will stay that way until the memory model is implemented in the simulator To cope with this limitation and also to enable additional performance comparisons a low level intrinsic builtin interface to T has been implemented in the TERAFLUX back end compiler This interface retain a C syntax and semantics abstracting the low level optimization details of the compilation flow but it requires the programmer to think directly in terms of data flow threads carrying the frame meta data explicitly Still it allows pure data flow programs to be written and to scale on the full architecture Technical informati
79. of the execution time for both Fibonacci computation of 40 and Matrix Multiply with matrix size of 512 We simulated a number of cores from 1 to 1024 in steps of powers of 2 in the configurations up to 32 cores the systems are single node from 64 to 1024 cores each simulation run on systems with many nodes each node hosting 32 cores FIBONACCI MATRIX MULTIPLY 1021 1024 512 512 256 256 128 128 264 2 64 a 2 9 16 16 g 28 4 4 2 F IB 40 B MMUL 512 1 1 1 2 4 8 16 32 64 128 256 512 1024 1 2 4 8 16 32 64 128 256 512 1024 Number of cores Number of cores Fig 6 Multi node simulation Fibonacci with input set to 40 and Matrix Multiply with matrix size 512x512 partitioned in a number of blocks equal to the number of cores As we can see we have reached the ability to simulate the dataflow execution model not only in the single core but also across nodes without changing the programming model or execution model when passing from the single node case to the multi node case Of course we need to tune up the system in order to evaluate the sensitivity to the availability of resources like bandwidth and memory controllers as explored initially in the deliverables D2 1 D2 2 regarding the Application work package In the case of the matrix mult
80. oject TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 r4 rb rl4 rlo r55 r58 rl16 uint64 t r60 DF TWRI DF TWRI DF TWRI DF TWRI DF TWRI DF TWRI DF TWRI DF TWRI 14 r10 16 uint8 t r16 ar uint64 t r59 DF DF DE DF DF DF DF DF DF DFE DE void mul_thread WRIT PWR mE mE mE gt mE WRIS mE mE DEST DF_TLOAD 8 uint64_t r2 r7 r6 uint64_t r10 uint64_t r44 DF DF DE DF DF DE DF DF DF void mul thread next el WRIT WRIT WRIT WRIT WRIT WRIT WRIT WRIT DEST DF_TLOAD 8 DF DF DF DE DF DF TREAD r60 1 r60 2 r60 3 r60 4 60 5 r60 6 r60 7 r60 8 r5 TREAD 4 size TREAD 5 np _TREAD 6 7 size size np _TREAD 8 log size DF DF TREAD 9 pointer to the FM of thread join threads TREAD 10 current proc X TSCHEDULE 1 mul thread 8 A B size size size num processors represents this part size log size pointer to the FM of thread join threads DF TSCHEDULE cnd main ep 1 10
81. on source code tutorial examples and benchmarks are available online and updated regularly http www di ens fr OpenStream html en Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 26 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 2 10 Double Execution and Thread Restart Recovery in a Single Node COTSon Modules UAU HP In this section we will give an overview of the implementation details of the FDU subsystem Double Execution and Thread Restart Recovery in the TERAFLUX simulator For more details about the Double Execution and Thread Restart Recovery mechanisms please refer to Deliverable D5 3 which describes the technical implications of Double Execution and the thread restart recovery for the TERAFLUX architecture The enhanced source code is publicly available in the tflux test branch of the public COTSon SourceForge repository simulator extensions in this section are based on the functional tsu2 implementation provided by the partners HP and UNISI Please note that at this point of the simulator integration there is neither a functional differentiation between D FDU and L FDU nor between the D TSU and L TSU Hence we refer just to FDU and TSU respectively 2 10 1 FDU subsystem in COTSon The FDU subsystem
82. only valid proposals appear Proposals are made according to what user has written so far Many of the parameters in a pragma directive have predefined values like the scheduling policies shown in the above image Resource testfile main c Eclipse Plat File Edit Source Refactor Navigate Search Project Run Window Help amp 7 Y Y 5 X X Project Explorer 22 A A main c 3 Sample BESle mo Kernel amp testfile main c Start Program End Program Created on May 21 2012 Private Variable Author Lefteris Block End Block B include lt stdio h gt Property Value pragma ddm startprogram thread number 1 void main sched_mode IDVM_DYNAMIC sched_value 0 t ber 3 pragma ddm endprogram jen arity number 1 import var Outline 3 Bl Task List LAN pT import_export var MP stdioh The scheduling mode determines how data main void will be scheduled to the processors DVM_DYNAMIC Tasks 2 0 items p Description Resource Path Location ne x 2 Fig 20 The side panel plug in imported to the Eclipse platform 3 3 22 The Side Panel Plug in Fig 20Errore L origine riferimento non stata trovata depicts the side panel plug in imported to the Eclipse platform This plug in consist of two lists the Sample View list and the Property list The Sample View contains th
83. r schools Workshops and conferences like CASTNESS Workshops the PUMPS Summer School 2011 and 2012 the HiPEAC 2012 conference and the Supercomputing 2012 conference The second activity from BSC to train other partners in the use of the target simulation environment has been on the occasion of the mechanism devoted to sharing memory among COTSon nodes It is based on the characterized release consistency as an underlying foundation for the TERAFLUX memory model The three proposed operations have been Acquire Region Upgrade Permissions Release Region that have enabled the exploration of inter node shared memory techniques by replicating application memory in all nodes and mapping all guest memory onto a single host buffer We have implemented a release consistency backend for COTSon where the application can request acquires upgrades releases on memory regions Our lazy memory replication aggregates multiple updates and a functional backend copies memory among nodes Discussions among partners have enhanced the implemented backend and benchmark tests have shown its usability Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 43 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 References Cameron95 Cameron Woo S Ohara M
84. ructure in the TSU by a per thread write buffer wbuf This write buffer is created along with a dthread object when the TSU has received TSCHEDULE operation After the thread becomes ready to execute all subsequent TWRITEs of this thread will be redirected to the wouf data structure The TWRITEs of the leading thread are held in the write buffer until both the leading thread and the trailing thread executed their TDESTROY instructions Additionally the CRC 32 signature incorporating the target Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 27 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 thread ID the target address and the data of each single TWRITE is calculated and stored in the FDU for both the trailing and the leading thread When the TSU receives a TDESTROY instruction it checks whether the redundant thread has finished its execution by verifying the thread s current state If the redundant thread is still running the TSU marks the finished thread as ready to check Otherwise the TSU calls the FDU singleton to indicate that both threads have finished their execution The FDU in turn compares the stored CRC 32 signatures and returns a Boolean result indicating a fault free execution true or a faulty execution fals
85. ten DMA PIO mode PIOIPIO mode stack 0x21e7000 16384 stack 0 21 000 16384 stack 0x21f 1000 16384 stack 0x21f6000 16384 Program Matrix Multiplication Cores 4 AQ threashold 1 Matrix Size worker frame at 0x21e7000 worker frame at 0x21ec000 worker frame at 0x21f1000 worker frame at 0x21f6000 All workers done goodbye Serial time 0 000188 Parallel time 0 000237 0 792757 xput LOG yvar home2 georgios sun cotson lab branches tf lux test tsu ddm LOG touch terminate xput terminate terminate root cotson _ ml Running Fig 9 Blocked Matrix Multiply running on a four cpu machine 2 8 Integrating DDM TSU into COTSon UCY As a continuation of the work described in the previous Section we have integrated the DDM TSU into COTSon by using as template the tsu2 code provided in the TERAFLUX public repository https cotson svn sourceforge net svnroot cotson branches tflux test tsu2 and the TSU version of the DDM system The tsu2 operates as an intermediate API to provide communication between the user application and the TSU unit To validate this implementation of the TSU we have executed the blocked matrix multiply benchmark for 4 workers on a single machine see Fig 9 We have used a single queue to store threads that are ready for execution and a FIFO policy for scheduling The TSU does not operate in busy wait mode but instead it is event driven execution which seems to make simulat
86. tep has been to code those examples by hand in order to allow the WP4 to have some simple examples to target the proposed T instructions On the simulator side the efforts in this year had been to support properly the execution of the Dataflow Thread this is coded in the publicly available modules TSU TSU2 TSU3 on the Source Forge website DF Threads can be either waiting to become ready i e their synchronization count has not reached Zero or already in the ready queue waiting for execution once some core becomes available In the single node experiments we varied the number of cores from 1 to 32 In this context simulations have been successfully performed 4 cores 8 cores 1600 791400 4200 300 2800 12600 w AQ 5200 0 1600 4 6 8 10 Clock cycles millions 16 cores 12 14 16 24400 21200 21000 Ld 800 5 600 400 o 4 6 8 10 Clock cycle millions 1600 1400 1200 m o of threads thousands n B oO o S 8 8 8 o 1600 1400 4 e e o R o o o of threads thousands S A TH_WAITING TH READY TH RUNNING TOTAL 6 8 10 Clock cycle millions 32 cores 12 14 16 18
87. the real multiplication algorithm for the sub block and then to re schedule itself under the condition that the multiplication algorithm has been started for all the np partitions i e this is not the np execution of the main ep 1 thread The mul thread is responsible for calculating the bound indexes for the sub block and then it unconditionally schedules the mul thread next el thread which will compute the indexes for reading from the input matrices A and B and pass them to the calc curr el thread for calculation Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 45 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 The calc_curr_el thread reads the current element values from matrices A and B then calculates the value of the current element of C if all the terms of the sum have been calculated then the move_to_next_el thread is scheduled otherwise it schedules itself again for reading the next elements from the input matrices The move_to_next_el thread is responsible for checking the completeness of the current sub block calculus if the sub block is ready then the mul_thread_end is scheduled otherwise the mul_thread_next_el thread is scheduled again for calculating the next element of the current sub block T
88. tid is guaranteed to have bits 0 31 at 0 see TWRITE Constraint ctid and 91 must specify the same register identifier 1 the same x86_64 register For variable sc or sc gt 255 the general version TSCHEDULEP is required e tdestroy dfr Called at the end of a dataflow thread to signal the TSU the end of a thread execution and free up thread resources To reduce simulation polling overhead the thread is destroyed internally and returns the address of the next thread if any available in register this slightly deviates from the previously defined syntax just TDESTROY It is a peeling optimization dealing with the common case when the queue of ready threads is not empty so that there is no need to return to the polling loop e treadqi im Reads the 64b value stored at the im immediate offset of the frame of the se f thread This is the immediate form with im lt 256 For im gt 255 or variables use the general form TREAD The offset immediate is expressed in 64b words i e offset 2 is byte 16 Note this implementation is slightly different from what described D6 2 where we proposed to write tid only in case of true condition that is tschedule amp 96tid ip Ssc Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 13 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing
89. tion Resource I ui 7 e main fact int lt D v n Writable Smart Insert 13 42 Fig 23 The side panel plug in showing the properties of a selected pragma A user is able to change the properties of a specific pragma by moving the cursor on the line of that pragma This will cause the Property list of the side panel plug in to show the properties of the selected pragma as show in Fig 23 Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 42 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 3 4 Support to the Partners for Implementing COTSon Extensions HP The COTSon simulator is released by HP to the scientific community In the context of the TERAFLUX activities the COTSon simulator has been extended in order to provide the partners with all the features needed for their research In particular the simulation platform is shared among all the members of the TERAFLUX consortium so that each partner can add features or extend existing ones In this process it is important to have a strong support from the simulator releaser in order to speed up the development phase To this end and even before the project started HP provided a strong support to the other TERAFLUX partners in the
90. tly dealing with transactional execution and transaction aborts requires a TM module for SimNow which exists outside COTSon There are two interfaces provided by AMD in order to interact with SimNow Analyzer interface and Monitor interface The Monitor interface is much faster than the Analyzer interface but allows less interaction with execution If we intercept memory accesses as required during Transactions then Analyzer interface runs 40 50X slower than the monitor interface This performance advantage was why the Monitor interface was chosen for COTSon For our TM implementation two important features are needed Intercepting memory accesses to detect conflicts and saving and then restoring register state of the processors to correctly deal with aborts Further to arrive at realistic performance estimates existing performance models in COTSon need to be extended with Transactional behavior For this reason we will be using both the interfaces together when using transactions 2 12 1 Functional Transaction Support Functional support in a SimNow analyzer module keeps track of read and write sets detects conflicts and performs the necessary cleanup in case of aborts At this level our system can model both eager and lazy versioning and eager and lazy conflict detection The behavior of the TM is described in more detail in Deliverable D6 3 During the implementation and testing of this module several bugs were identified in SimNow Thes
91. to the Matrix Multiplier described in previous section The well known Radix Sort benchmark which is one of the kernel application included in the SPLASH 2 suite Cameron95 has been also developed in the T C like style for our experiments The implementation of this algorithm is still ongoing because it requires some protection mechanism Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 16 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 for managing concurrent accesses to shared data Since in the TERAFLUX project the Transactional Memory TM is supposed to be adopted for this purpose the implementation of this benchmark will be completed in the near future by exploiting the new TM feature added by the UNIMAN partner to the COTSon platform 2 4 Single Node T Tests UNISI In order to show the potential of the implementation of T we show here the possibility to collect some statistics number of Dataflow Threads that are executing running and waiting related to the execution of some benchmarks on the modified COTSon platform We selected for this sake Matrix Multiplier described in Section 2 2 1 or the Recursive Fibonacci already introduced and described in previous deliverables D6 1 D6 2 as pipe cleaners The first s
92. ue in register tval to the location at im immediate offset of the frame of thread tid This is the immediate form with im 256 The offset im is expressed in bytes and has to be 64b aligned For im gt 255 or variable use the general form TWRITE This is just a different way to write the TWRITEQT e res Loads the TSU frame values into a locally allocated memory chunk of size res that is directly accessible by the thread with standard loads and stores Depending on the implementation of the TSU it could be simply a no op e tstore tloc ptr len Writes the values in memory starting from address ptr and length len to the frame location tloc The tloc register packs a thread handle tid and offset off so that tloc tid off The value of tid is the return value of the TSCHEDULE instruction and its variants and is guaranteed to have the 32 least significant bits set to 0 so that a thread location can be constructed with standard address arithmetic for example tid could be the address of the frame e tstoreqi tloc ptr len Immediate version of the TSTORE operation with len a 1 256 immediate Other instructions are used in the runtime e dfr called within a worker thread polls the TSU about work to do work address of the dataflow thread to start is returned in the register dfr Used in the runtime and not in the dataflow program tinit nopr pstack initializes a dataflow worker
93. uler Emulator Tool capable of reproducing the functional behavior synonymous in this context of Instruction Set Simulator ISS D FDU Distributed Fault Detection Unit ISA Instruction Set Architecture ISE Instruction Set Extension L Thread Legacy Thread a thread consisting of legacy code L FDU Local Fault Detection Unit L TSU Local Thread Scheduling Unit MMS Memory Model Support NoC Network on Chip Non DF Thread An L Thread or S Thread NODE Group of cores synonymous of CLUSTER OWM Owner Writeable Memory OS Operating System Per Node Manager A hardware unit including the DTS and the FDU PK Pico Kernel Sharable Memory Memory that respects the FM OWM TM semantics of the TERAFLUX Memory Model S Thread System Thread a thread dealing with OS services or I O StarSs A programming model introduced by Barcelona Supercomputing Center Service Core A core typically used for running the OS or services or dedicated I O or legacy code Simulator Emulator that includes timing information synonymous in this context of Timing Simulator TAAL TERAFLUX Architecture Abstraction Layer TBM TERAFLUX Baseline Machine TLPS Thread Level Parallelism Support TLS Thread Local Storage Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 4 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement
94. used with variable sc or sc gt 255 For immediate versions constant sc 256 tschedulei is more efficient e tread res off Reads the 64b value stored at the offset of off register of the frame of the same thread This is the general form of tread see also TREADI with variable or gt 256 offset Deliverable number D7 4 Deliverable name Report on knowledge transfer and training File name TERAFLUX D74 v10 doc Page 14 of 50 Project TERAFLUX Exploiting dataflow parallelism in Teradevice Computing Grant Agreement Number 249013 Call FET proactive 1 Concurrent Tera device Computing ICT 2009 8 1 e treadi res im Reads the 64b value stored at the im immediate offset of the frame of the se f thread This is the immediate form with im lt 256 For im gt 255 or variable use the general form TREAD e twrite tloc tval Writes the 64b value from register tval to the location stored in register tloc This is the general form of TWRITE see also TWRITEI with variable frame locations The tloc register packs a thread handle t id and offset off so that tloc tid off tid is the return value of the tschedule instruction and its variants and is guaranteed to have the 32 least significant bits set to 0 Hence tid and off can be used to construct the thread frame location by adding the values or doing any other standard address arithmetic e twritei tid tval im Writes the 64b val
95. uses the periodic AMD SimNow timer callback FDU call periodic The FDU itself is similar to the TSU implemented as a singleton object At each periodic call the monitoring subsystem generates heartbeats and pushes them into the FDU s monitoring queue After all cores have pushed their heartbeats the FDU singleton processes them in its M onitor A analyse P lan E execute cycle stores the information in its knowledge base and updates its core records For more details on the MAPE cycle and the FDU internals please refer to the Deliverables D5 1 D5 2 and D6 2 The current D FDU implementation maintains interfaces to two TERAFLUX device types 1 The cores within a node 2 The TSU used version tsu2 The FDU core interface implements a FIFO queue message queue shared between the cores and the FDU The FDU TSU interface is a function get core record exposed by the FDU singleton This function is called by the node s TSU and returns the latest core record for a given core ID enclosing information about the current core performance its reliability value and wear out of a core Whenever the TSU tries to schedule a new thread it queries the FDU for a new core record and if required adjusts its scheduling policy 2 10 2 Double execution and Recovery Support Double Execution and Thread Restart Recovery both require an execution free from side effects To ensure this in the TERAFLUX simulator we extended the dthread data st

Report on knowledge transfer and training

Contents

Download Pdf Manuals

Related Search

Related Contents