Home

AMD Accelerated Parallel Processing OpenCL User Guide

1. 6 7 C 6 8 ir qubenucpucme 6 8 Data dependent refinement sn 6 8 Extracting Primes from an array by using device side enqueue 6 9 Binary search using device enqueue or kernel enqueue 6 15 65 Atomics and synchronization a mamam NAAN AABANGAN AANGAL 6 18 6 5 1 i1 e 6 18 652 UGG M 6 18 Pid s S 6 19 Atomic Compare and Exchange CAS 6 20 Atomic FE pam 6 20 66 PIPOGasaiaaaaaaaayahia aaa AANI a a AA nia 6 21 6 6 1 OVGIVIOW AA 6 21 66 2 FUNCTIONS for accessing pipes oerte ntn imb ie taret tita EE 6 22 OQ Uaa AA Re 6 23 SEMEN NAPAPA 6 24 6 7 1 AA RA PA 6 24 6 8 Program scope global Variables eene nennen nnne tentent tentent 6 24 6 8 1 OVGIVIOW 6 24 6 9 Image Enhancements cien AIAI TABA 6 24 6 9 1 DUE NS s oci umo oni um x ia mat li cna E c E M a al 6 24 092 c 6 25 69 3 Depth Nn 6 27 6 10 Non uniform work group Size nine 6 28 BADA OWM me w n n S XSO M nE T 6 28 EUM UC DEA S Li t 6 28 6 11 1 Migrating from
2. 4 generate binary images offline cl amd offline devices 6 GL context associate with CL context 1 global buffer 9 10 source or destination for instruction 10 storing writes to random memory locations 9 global memory VRAM 9 global scope 4 global work size defined 4 444444 eR RR aya ag eae ses 3 dividing into work groups 3 GLUT windowing system 2 GNU project debugger GDB description 2 GNU project debugger GDB See GDB GPU ecas aed bees eae b ur eee aes 1 binaries 1 communication between host and GPU 8 copying data from host to GPU 6 dataflow between host and GPU 6 parallel min function code 19 parallelizing large numbers of work items 10 predefined macros 13 processing 2 LLVM IR to CAL IL module 2 OpenCL runtime 1 programming 2 scheduling 10 specific macros 1 storing writes to random memory locations 9 Slructure iso obe cR EIE 6 2 GPUs dedicating for specific operations 10 masking uecop estere nhe ox m thas 10 granularity branch 4 data parallel 2 H hardware o
3. 1 pre ICD code snippet 1 2 USING o Kab NADA NBA KANA GE DNA ENNA 1 instruction branch 4 global buffer 10 kernel 9 sequence stream cores 3 instructions Index 8 scalar and vector 7 integer performing operations 3 4 Intel C C compiler compiling OpenCL on Windows 2 interoperability context code for creating 3 interrelationship of memory domains 6 K kernel code parallel min function 20 code compilation adding a defined macro with the name of the extension 3 code example 13 command queue elements 9 commands 6 compilation error 16 compiling 1 compute definition 1 strengths siehe keen 1 creating within programs 4 debugging 2 definition of 1 device specific binaries 2 distributing in OpenCL 1 executed as a function of multi dimensional domains of indices 3 executing 2 using corresponding command queue 4 execution device specific operations 4 execution calls command queue elements
4. Another example of a pre ICD code snippet follows J EFAULT 0 NULL status clGetDeviceIDs NULL CL DEVICE TYPE amp numDevices The ICD compliant version of the code snippet is Appendix B The OpenCL Installable Client Driver ICD Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING o EFAULT O NULL status clGetDeviceIDs platform CL DEVICE TYPE amp numDevices NOTE It is recommended that the host code look at the platform vendor string when searching for the desired OpenCL platform instead of using the platform name string The platform name string might change whereas the platform vendor string remains constant for a particular vendor s implementation B 2 Using ICD B 3 Copyright 2013 Advanced Micro Devices Inc All rights reserved B 4 AMD ACCELERATED PARALLEL PROCESSING Appendix B The OpenCL Installable Client Driver ICD Copyright 2013 Advanced Micro Devices Inc All rights reserved C 1 Overview AMD ACCELERATED PARALLEL PROCESSING Appendix C OpenCL Binary Image Format BIF v2 0 OpenCL Binary Image Format BIF 2 0 is in the ELF format BIF2 0 allows the OpenCL binary to contain the OpenCL source program the LLVM IR and the executable The BIF defines the following special sections e source for storing the OpenCL source program e llvmir for storing the Ope
5. 9 hello world sample 2 instructions over PC le bus 9 keyword aaa es 1 loading 32 2 ae ea ee eee as 2 no breakpoint set 2 overloading 2 program OpenGL 42 529 paka ee ieee ea es 2 programming enabling extensions 3 programs using extensions 2 running on compute unit 2 SGAM 24 2255428 2 RR Goh aed gee 10 submitting for execution 4 synchronization points 2 work item 2 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING kernel and function overloading 1 kernel commands 6 kernel name construction aaan 2 kernels debugging 2 kernels and shaders 2 Khronos approved list of extensions 3 website 14 L L1 caclie s s yo Rex passes Be 10 L2 cache 10 7 latency hiding in memory 10 latency hiding 4 11 launching threads AA 19 LDS description 42444440 RUP yeas 2 gather scatter model 2 size allocated to work group 2 using local memory 9 LDS model constraints 2 data sharing 2 memory accesses outside
6. 3 4 driver layer issuing commands 9 translating commands 9 E element work item 3 ELF rodata storing OpenCL runtime control data 1 shstrtab forming an ELF 1 strtab forming an ELF 1 symtab forming an ELF 1 Index 5 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING text storing the executable 1 tormat 24 4 sde ou dus edasossabedoahes 1 forming soseri fau eee eee a the 1 header fields 2 special sections BIF us Le dei aede rire 1 enforce ordering between or within queues events oec x xev ud amare Een 5 synchronizing a given event 3 within a single queue command queue barrier 4 engine DMA searah eet db Pade ba rbd ive 9 enqueuing commands in OpenCL 5 multiple tasks parallelism 2 native kernels parallelism 2 environment variable AMD OCL BUILD OPTIONS 5 AMD OCL BUILD OPTIONS APPEND 5 setting to avoid source changes 2 event commands 6 enforces ordering between queues 5 within queues 5 synchronizing 3 event commands 6 events forced ordering between
7. e Kernel commands for example clEnqueueNDRangeKernel etc e Memory commands for example clEnqueueReadBuffer etc and e Event commands for example clEnqueueWaitForEvents etc As illustrated in Figure 3 2 the application can create multiple command queues some in libraries for different components of the application etc These queues are muxed into one queue per device type The figure shows command queues 1 and 3 merged into one CPU device queue blue arrows command queue 2 and possibly others are merged into the GPU device queue red arrow The device queue then schedules work onto the multiple compute resources present in the device Here K kernel commands M memory commands and E event commands 3 2 1 Running Code on Windows The following steps ensure the execution of OpenCL applications on Windows 1 The path to OpenCL lib SAMDAPPSDKROOT 1ib x86 must be included in path environment variable 2 Generally the path to the kernel file Template Kernel cl specified in the host program is relative to the executable Unless an absolute path is specified the kernel file must be in the same directory as the executable Programming Layer Command Queues For CPU queue For CPU queue For GPU queue Device Command Queue CPU Core 1 CPU Core 2 GPU Core 1 GPU Core 2 Figure 3 2 Runtime Processing Structure 3 6 Chapter 3 Building and Running OpenCL P
8. 5 work group 2 work item processing 4 Windows calling convention 7 compiling Intel C C compiler 2 OpenGL i2 oce des 2 Visual Studio 2008 Professional Edition 2 debugging OpenCL kernels 4 running code 6 settings for compiling OpenCL 2 work group allocation of LDS size 2 barmers isla coche tous aies NA 4 composed of wavefronts 2 concept relating to compute kernels 2 defined 4 dividing global work size into sub domains 3 dividing work items 3 number of wavefrontsin 3 performance 2 relationship to wavefront 4 relationship with wavefronts 4 specifying wavefronts 3 work item reaching point barrier in the code 2 work item branch granularity 4 communicating through globally shared memory 3 through locally shared memory 3 creation 4 deactivation 9 dividing into work groups 3 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING element 3 encountering barriers 4 executing the branch 4 ke
9. 2 3 FunctionName string 3 kernel code compilation adding defined macro 3 naming conventions 1 optional 4 4 444 dress dere band Rte 1 provided by a specific vendor 1 provided collectively by multiple vendors 1 querying for a platform 1 querying in OpenCL 1 same name overrides 2 use in kernel programs 2 F f n bin source AMD supplemental compiler option 6 f no bin amdil AMD supplemental compiler option 6 compiler option 4 f no bin exe AMD supplemental compiler option 6 compiler option 4 f no bin Ilvmir AMD supplemental compiler option 6 compiler option 4 f no bin source compiler option 4 fetch unit Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING loadSa s d ae Naan KG mna or she 9 processing 4 4 10 STOTBS den pal maana Tahun pgh hha 9 streaming stores 9 transferring the work item 9 fetches memory stalls 11 floating point operations double precision 4 single precision 4 flow control 4 branching exu Se hee canals odd dress 4 execution of a single instru
10. A similar operation is applied to other components of the vectors Built in Function uintn amd sadd uintn src0 uintn srcl uintn src2 Description dst s0 src2 s0 abs src0 s0 srcl s0 A similar operation is applied to other components of the vectors Built in Function uintn amd bfm uintn src0 uintn srcl Description dst s0 1 lt lt srcO s0 amp Ox1f 1 lt lt srcl s0 amp Ox1f A similar operation is applied to other components of the vectors Built in Function uintn amd bfe uintn src0 uintn srcl uintn src2 Description NOTE The gt gt operator represents a logical right shift offset srcl s0 amp 31 width src2 s0 amp 31 if width 0 dst s0 0 else if offset width lt 32 dst s0 src0 s0 lt lt 32 offset width gt gt 32 width else dst s0 src0 s0 gt gt offset A similar operation is applied to other components of the vectors A 10 Appendix A OpenCL Optional Extensions Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Built in Function intn amd bfe intn src0 uintn srcl uintn src2 Description NOTE operator gt gt represent an arithmetic right shift offset srcl s0 amp 31 width src2 s0 amp 31 if width 0 dst s0 0 else if offset width lt 32 dst s0 src0 s0 lt lt 32 offset width gt gt 32 width else dst s0 sre0 s0 gt gt offset A similar
11. DEVMODE deviceMode continue EnumDisplaySettings dispDevice DeviceName ENUM CURRENT SETTINGS amp deviceMode xCoordinate deviceMode dmPosition x yCoordinate deviceMode dmPosition y WNDCLASS windowclass windowclass style CS OWNDC windowclass lpfnWndProc WndProc windowclass cbClsExtra 0 windowclass cbWndExtra 0 windowclass hInstance NULL windowclass hIcon LoadIcon NULL IDI APPLICATION windowclass hCursor LoadCursor NULL IDC ARROW windowclass hbrBackground HBRUSH GetStock0bject BLACK BRUSH windowclass lpszMenuName NULL windowclass lpszClassName reinterpret cast LPCSTR SimpleGL RegisterClass amp windowclass gHwnd CreateWindow reinterpret cast LPCSTR SimpleGL reinterpret_cast lt LPCSTR gt OpenGL Texture Renderer WS CAPTION WS POPUPWINDOW xCoordinate yCoordinate ScreenWidth screenHeight NULL NULL windowclass hInstance NULL hDC GetDC gHwnd pfmt ChoosePixelFormat hDC amp pfd ret SetPixelFormat hDC pfmt amp pfd hRC wglCreateContext hDC ret wglMakeCurrent hDC ARC cl context properties properties CONTEXT PLATFORM cl context properties platform GL CONTEXT KHR C C pos cl context properties HRC C 0 L WGL HDC KHR cl context properties hDC
12. 5 exceptions ln 6 executing branch 4 kernels 2 444 ort AG data ea 2 3 using corresponding command queue 4 kernels for specific devices OpenCL programming model 3 lOOD 1 52 esr Reed eb rp hinalo haing 5 non graphic function data parallel programming model 2 execution command queue 9 of a single instruction over all work items 2 OpenCL applications 6 order barriers veces Rubens Aa 4 single stream core 10 explicit copying of data 6 Index 6 extension cl amd popent 7 clCreateKernel 2 extension function pointers 3 extension functions NULL and non Null return values 3 extension support by device for devices 1 15 for devices 2 and CPUs 16 extensions AA 2 AMD vendor specific 4 approved by Khronos Group 1 approved list from Khronos Group 3 character strings 1 cl amd device attribute query 5 cl amd event callback registering event callbacks for states 6 cl amd fp64 4 cl amd media ops 7 9 cl_amd_printf 12 Cl ext amet 4 compiler set to ignore 2 device fission 4 disabling s REIR peta Be kad 2 enabling
13. CL enum name lt VendorName gt A 2 Querying Extensions for a Platform To query supported extensions for the OpenCL platform use the clGetPlatformInfo function with the param name parameter set to the enumerated value CL PLATFORM EXTENSIONS This returns the extensions as a character string with extension names separated by spaces To find out if a specific extension is supported by this platform search the returned string for the required substring AMD Accelerated Parallel Processing OpenCL Programming Guide A 1 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING A 3 Querying Extensions for a Device To get the list of devices to be queried for supported extensions use one of the following e Query for available platforms using clGetPlatformIDs Select one and query for a list of available devices with c1GetDevicelDs e Fora specific device type call c1CreateContextFromType and query a list of devices by calling clGetContextInfo with the param name parameter set to the enumerated value CL CONTEXT DEVICES After the device list is retrieved the extensions supported by each device can be queried with function call clGetDeviceInfo with parameter param name being set to enumerated value CL DEVICE EXTENSIONS The extensions are returned in a char string with extension names separated by a space To see if
14. Specifies to the compiler not to optimize This is equivalent to the OpenCL standard option cl opt disable e f no bin source Does not generate OpenCL source in the source section For more information see Appendix C OpenCL Binary Image Format BIF v2 0 e f no bin llvmir Does not generate LLVM IR in the 11vmir section For more information see Appendix C OpenCL Binary Image Format BIF v2 0 e f no bin amdil Does not generate AMD IL in the amdil section For more information see Appendix C OpenCL Binary Image Format BIF v2 0 e f no bin exe Does not generate the executable ISA in text section For more information see Appendix C OpenCL Binary Image Format BIF v2 0 To avoid source changes there are two environment variables that can be used to change CL options during the runtime e AMD OCL BUILD OPTIONS Overrides the CL options specified in clBui ldProgram 0 e AMD OCL BUILD OPTIONS APPEND Appends options to the options specified in clBuildProgram A 8 6 cl_amd offline devices A 8 7 A 6 To generate binary images offline it is necessary to access the compiler for every device that the runtime supports even if the device is currently not installed on the system When during context creation CL CONTEXT OFFLINE DEVICES AMD is passed in the context properties all supported devices whether online or offline are r
15. 6 9 3 Depth images As with other image formats clCreateImage is used for creating depth image objects However the channel order must be set to CL DEPTH as illustrated below For the data type of depth image OpenCL 2 0 supports only CL FLOAT and CL UNORM INT16 cl image format imageFormat imageFormat image channel data type CL UNORM INT16 imageFormat image channel order CL DEPTH cl mem imageObj clCreatelImage context A valid OpenCL context CL MEM READ ONY CL MEM COPY HOST PTR amp imageFormat amp desc cl image desc pSrcImage An pointer to the image data amp retErr Returned error code In OpenCL 2 0 depth images must be of type image2d or image2d array clCreateImage will fail for other dimensions when creating depth image A depth image object can be read by using the read_imagef call in the kernel For write write imagef must be used read image i ui and write image ilui are not supported for depth images OpenCL 2 0 C introduces two data types image2d depth t and image2d array depth t for declaring depth images The following kernel code sample illustrates how to read depth image objects Read depth image object input based on sampler and offset and save it results kernel void sample kernel read only image2d depth t input 6 9 Image Enhancements 6 27 Copyright 2013 Advanced Micro Devices Inc All rights reserved
16. float4 0 for i 0 i lt filterDim y i for j 0 j lt filterDim x j sum convert float4 src i width j float4 filter i filterDim x Chapter 6 OpenCL 2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING 31 return sum Note The compiler will try to resolve the address space at compile time Otherwise the runtime will decide whether the pointer references the local or the global address space For optimum performance the code must make it easy for the compiler to detect the pointer reference by avoiding data dependent address space selection so that run time resolution which is costly is not required 6 4 Device side enqueue 6 4 1 Overview In OpenCL 1 2 a kernel cannot be enqueued from a currently running kernel Enqueuing a kernel requires returning control to the host OpenCL 2 0 introduces clang blocks and new built in functions that allow kernels to enqueue new work to device side queues In addition the runtime API call clCreateCommandQueue has been deprecated in favor of a new call clCreateCommandQueueWithProperties which allows creating device side command queues Device enqueue allows programmers to launch the child kernels from the current parent kernel This mechanism offers the following benefits 1 As the overhead of returning control to the host to launch the k
17. 1 Installable Client Driver ICD 1 AMD APP KernelAnalyzer 1 AMD Core Math Library ACML 1 AMD GPU number of compute units 8 AMD Radeon HD 68XX 15 AMD Radeon HD 69XX 15 AMD Radeon HD 75XX 15 AMD Radeon HD 77XX 15 AMD Radeon HD 78XX 15 AMD Radeon HD 79XX series 15 AMD Radeon HD 7XXX 2 AMD Radeon R9 290X 4 8 AMD supplemental compiler 6 so Ople is ee BKA eae KG di ewes 6 AMD supplemental compiler option f n bin source 6 f no bin amdil aaa aana 6 f no bin exe 6 f no bin llvmir 6 AMD vendor specific extensions 4 amd bitalign Index 1 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD built in function amd_bytealign built in function amd_lerp built in function AMD OCL BUILD OPTIONS environment variables AMD OCL BUILD OPTIONS APPEND environment variable amd pack built in function amd_sad buillt in function amd sad4 built in function amd_sadhi built in function amd unpack0 built in function
18. F 1 8 F 1 9 AMD ACCELERATED PARALLEL PROCESSING work group scan exclusive min work group scan inclusive add work group scan inclusive max work group scan inclusive min Pipe functions read pipe write pipe reserve read pipe reserve write pipe commit read pipe commit write pipe is valid reserve id work group reserve read pipe work group reserve write pipe work group commit read pipe work group commit write pipe get pipe num packets get pipe max packets Enqueueing Kernels enqueue kernel get kernel work group size get kernel preferred work group size m ultiple enqueue marker retain event release event create user event is valid event set user event status capture event profiling info get default queue ndrange 1D ndrange 2D ndrange 3D E1 New built in functions in OpenCL 2 0 Min exclusive scan across work group Sum inclusive scan across work group Max inclusive scan across work group Min inclusive scan across work group Read from pipe Write to pipe Reserve reads from pipe Reserve writes to pipe Commit reserved pipe reads Commit reserved pipe writes Test reservation value Work group read reservation work group write reservation work group commit read reservation work group commit write reservation get current number of packets in pipe get capacity of pipe Enqueue block as kernel Query max work group size Query preferred divisor of work group size Enqueue a marker Incremen
19. NULL 3 Create a context and command queue on that device cl context context clCreateContext NULL 1 amp device NULL NULL NULL cl command queue queue clCreateCommandQueue context device O NULL 4 Perform runtime source compilation and obtain kernel entry point cl program program clCreateProgramWithSource context 1 amp source NULL NULL clBuildProgram program 1 amp device NULL NULL NULL cl kernel kernel clCreateKernel program memset NULL 5 Create a data buffer cl_mem buffer clCreateBuffer context CL MEM WRITE ONLY 1 6 Example Programs 1 13 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING 6 Launch the kernel Let OpenCL pick the local work size size t global work size NWITEMS clSetKernelArg kernel 0 sizeof buffer void amp buffer clEnqueueNDRangeKernel queue kernel 1 NULL amp global work size NULL 0 NULL NULL clFinish queue 7 Look at the results via synchronous buffer map cl uint ptr ptr cl_uint clEnqueueMapBuffer queue buffer CL TRUE CL MAP READ 0 NWITEMS sizeof cl uint 0 NULL NULL NULL int i for i 0 i lt NWITEMS i printf d Sd n i ptr i return 0 1 6 2 Example SAXPY Function This section provides an introductory s
20. While a basic understanding of GPU architectures is useful this document does not assume prior graphics knowledge It further assumes an understanding of chapters 1 2 and 3 of the OpenCL Specification for the latest version see http www khronos org registry cl This AMD Accelerated Parallel Processing document begins in Chapter 1 with an overview of the AMD Accelerated Parallel Processing programming models OpenCL and the AMD Compute Abstraction Layer CAL Chapter 2 discusses the AMD implementation of OpenCL Chapter 3 discusses the compiling and running of OpenCL programs Chapter 4 describes using the AMD CodeXL GPU Debugger and the GNU debugger GDB to debug OpenCL programs Chapter 5 provides information about the extension that defines the OpenCL Static C kernel language which is a form of the ISO IEC Programming languages C specification Appendix A describes the supported optional OpenCL extensions Appendix B details the installable client driver ICD for OpenCL Appendix C details the compute kernel and contrasts it with a pixel shader Appendix C describes the OpenCL binary image format BIF Appendix D provides a hardware overview of pre GCN devices Appendix E describes the interoperability between OpenCL and OpenGL Appendix F describes the new and deprecated functions in OpenCL 2 0 The last section of this book is an index Preface iii Copyright 2013 Advanced Micro Devices Inc All rights reserved
21. if clGetGLContextInfoKHR f clGetGLContextInfoKHR clGetGLContextInfoKHR fn clGetExtensionFunctionAddress clGetGLContextInfoKHR Appendix E OpenCL OpenGL Interoperability Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING size t deviceSize 0 status clGetGLContextInfoKHR properties CL CURRENT DEVICE FOR GL CONTEXT KHR 0 NULL amp deviceSize if deviceSize 0 no interopable CL device found cleanup wglMakeCurrent NULL NULL wglDeleteContext ARC DeleteDC HDC hDC NULL hRC NULL DestroyWindow gHwnd try the next display continue ShowWindow gHwnd SW SHOW Found a winner a break cl context properties properties CONTEXT PLATFORM cl context properties platform GL CONTEXT KHR cl context properties hRC L WGL HDC KHR cl context properties hDC SA GARE oO status clGetGLContextInfoKHR properties CL CURRENT DEVICE FOR GL CONTEXT KHR sizeof cl device id amp interopDevice NULL Create OpenCL context from device s id context clCreateContext properties 1 sinteropDevice 0 0 amp status Limitations e t is recommended not to use GLUT in a multi GPU environment E 2 Linux Operating System Single GPU Environment 5 2 1 1 Creating CL Context
22. pfd pfd pfd pfd pfd pfd pfd pfd pfd pfd pfd pfd ELFORMATDESCRIPTOR pfd nSize sizeof PIXELFORMATDESCRIPTOR nVersion Ng dwFlags PFD DRAW TO WINDOW PFD SUPPORT OPENGL PFD DOUBLEBUFFER PED TYPE RGBA E 24 8 iPixelType cColorBits cRedBits cRedshift cGreenBits cGreenShift cBlueBits cBlueShift cAlphaBits cAlphaShift cAccumBits cAccumRedBits cAccumGreenBits cAccumBlueBits cAccumAlphaBits cDepthBits cStencilBits cAuxBuffers iLayerType PFD MAIN PLANE bReserved dwLayerMask dwVisibleMask dwDamageMask yo wg dw te W og tet W te H mW IN Ne Ne Ne Ne ve ve ve NG wa ve Uc N XO ng OOOO NG NG Nes ZeroMemory amp pfd sizeof PIXELFORMATDESCRIPTOR WNDCLASS windowclass windowclass style CS OWNDC windowclass lpfnWndProc WndProc windowclass cbClsExtra 0 windowclass cbWndExtra 0 E 1 Under Windows E 3 Copyright 2013 Advanced Micro Devices Inc All rights reserved E 1 2 AM windowclass windowclass windowclass windowclass windowclass windowclass D ACCELERATED PARALLEL PROCESSING hInstance NULL hIcon LoadIcon NULL IDI APPLICATION hCursor LoadCursor NULL IDC ARROW hbrBackground HBRUSH GetStockObject BLACK BRUSH lpszMenuName NULL E lpszClassName reinterpret cast LPCSTR SimpleGL RegisterClass amp windowclas
23. 4 2 3 Sample GDB Session The following is a sample debugging session Note that two separate breakpoints are set The first is set in the host code at clEnqueueNDRangeKernel The second breakpoint is set at the actual CL kernel function export AMD OCL BUILD OPTIONS APPEND g 00 S export CPU MAX COMPUTE UNITS 1 S gdb BitonicSort GNU gdb GDB 7 1 ubuntu Copyright C 2010 Free Software Foundation Inc License GPLv3t GNU GPL version 3 or later http gnu org licenses gpl htm1l5 his is free software you are free to change and redistribute it There is NO WARRANTY to th xtent permitted by law Type show copying nd show warranty for details his GDB was configured as x86 64 linux gnu or bug reporting instructions please see http www gnu org software gdb bugs gt Reading symbols from home himanshu Desktop ati stream sdk v2 3 1nx64 samples opencl bin x86 64 BitonicSort done gdb b clEnqueueNDRangeKernel Breakpoint 1 at 0x403228 gdb r device cpu Starting program home himanshu Desktop ati stream sdk v2 3 1nx64 samples opencl bin x86 64 BitonicSort device cpu Thread debugging using libthread db enabled 3 z ig A d Unsorted Input 53 5 199 15 120 9 71 107 71 242 84 150 134 180 26 128 196 9 98 4 102 65 206 35 224 2 52 172 160 94 2 214 99 Platform Vendor Advanced Micro Devices Inc Device 0 AMD Athlon tm II X4 630 Processor New
24. 4096 4096 1 quick amp dirty MWC random init of source buffer Random seed portable time t ltime time amp ltime src ptr cl uint malloc num src items sizeof cl uint 1 22 Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING cl uint a cl uint ltime bz cl uint ltime cl uint min cl uint 1 Do serial computation of min for result verification for int i 0 i lt num src items i src ptr i cl_uint b a b amp 65535 min src ptr i lt min src ptr i min Get a platform clGetPlatformIDs 1 amp platform NULL 3 Iterate over devices for dev 0 dev lt NDEVS dev cl device id device cl context context cl command queue queue cl program program cl kernel minp cl kernel reduce cl mem src buf cl mem dst buf cl mem dbg buf cl uint dst ptr dbg ptr printf Nn s dev 0 CPU GPU Find the device clGetDeviceIDs platform devs dev 1 amp device NULL 4 Compute work sizes cl uint compute units size t global work size size t local work size size t num groups clGetDeviceInfo device sizeof cl uint scompute units NULL if devs dev CL DEVICE TYPE CPU global work size co
25. BONAIRE GPU ISA CAL TARGET KALINDI 29 KALINDI GPU ISA The BIF can be either 32 bit ELF format or a 64 bit ELF format For the GPU OpenCL generates a 32 bit BIF binary it can read either 32 bit BIF or 64 bit BIF binary For the CPU OpenCL generates and reads only 32 bit BIF binaries if the host application is 32 bit on either 32 bit OS or 64 bit OS It generates and reads only 64 bit BIF binary if the host application is 64 bit on 64 bit OS C 2 BIF Options OpenCL provides the following options to control what is contained in the binary f no bin source not generate OpenCL source in source section f no bin llvmir not generate LLVM IR in 11vmir section f no bin exe not generate the executable ISA in text section The option syntax follows the GCC option syntax By default OpenCL generates the 11vmir section amdil section and text section The following are examples for using these options Example 1 Generate executable for execution C 2 BIF Options C 3 Copyright 2013 Advanced Micro Devices Inc All rights reserved C 4 AMD ACCELERATED PARALLEL PROCESSING clBuildProgram program 0 NULL fno bin llvmir fno bin amdil NULL NULL Example 2 Generate only LLVM IR clBuildProgram program 0 NULL fno bin exe fno bin amdil NULL NULL This binary can recompile for all the other devices of the same device type Appendix C OpenCL Binary
26. NJ 1978 l Buck T Foley D Horn J Sugerman K Fatahalian M Houston and P Hanrahan Brook for GPUs stream computing on graphics hardware ACM Trans Graph vol 23 no 3 pp 777 786 2004 AMD Compute Abstraction Layer CAL Intermediate Language IL Reference Manual Published by AMD Buck lan Foley Tim Horn Daniel Sugerman Jeremy Hanrahan Pat Houston Mike Fatahalian Kayvon BrookGPU http graphics stanford edu projects brookgpu Buck lan Brook Spec v0 2 October 31 2003 http merrimac stanford edu brook brookspec 05 20 03 pdf OpenGL Programming Guide at http www glprogramming com red Microsoft DirectX Reference Website at http msdn microsoft com en us directx iv Preface Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING e GPGPU http www gpgpu org and Stanford BrookGPU discussion forum http www gpgpu org forums Contact Information URL developer amd com appsdk Developing developer amd com Forum developer amd com openclforum Preface Copyright 2013 Advanced Micro Devices Inc All rights reserved vi AMD ACCELERATED PARALLEL PROCESSING Preface Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Contents Preface Contents Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing LA a LANA uem um en Der FIM Hm IPM UR MA AA
27. This method of memory latency hiding helps the GPU compute device achieve maximum performance If none of TO T3 are runnable the compute unit waits stalls until one of TO T3 is ready to execute In the example shown in Figure 2 7 TO is the first to continue execution 2 4 Wavefront Scheduling 2 11 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Wavefronts CDN 0 20 40 60 80 gt executing C gt ready not executing XXX stalled Figure 2 7 Compute Unit Stall Due to Data Dependency Chapter 2 AMD Implementation Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Chapter 3 Building and Running OpenCL Programs The compiler tool chain provides a common framework for both CPUs and GPUs sharing the front end and some high level compiler transformations The back ends are optimized for the device type CPU or GPU Figure 3 1 is a high level diagram showing the general compilation path of applications using OpenCL Functions of an application that benefit from acceleration are re written in OpenCL and become the OpenCL source The code calling these functions are changed to use the OpenCL API The rest of the application remains unchanged The kernels are compiled by the OpenCL compiler to either CPU binaries or GPU binaries depending on the target device OpenCL Compiler Built In Librar
28. and create an executable For linking to a 64 bit library g o Template Template o lOpenCL L AMDAPPSDKROOT lib x86 64 For linking to a 32 bit library gt o Template Template o lOpenCL L AMDAPPSDKROOT lib x86 The OpenCL samples in the SDK provided by AMD Accelerated Parallel Processing depend on the SDKUtil library In Linux the samples use the shipped SDKUtil lib whether or not the sample is built for release or debug When compiling all samples from the samples opencl folder the SDKUtil lib is created first then the samples use this generated library When compiling the SDKUtil library the created library replaces the shipped library The following are linking options if the samples depend on the SDKUtil Library assuming the SDKUtil library is created in AMDAPPSDKROOT 1ib x86 64 for 64 bit libraries or AMDAPPSDKROOT 1ib x86 for 32 bit libraries g o Template Template o 1SDKUtil lOpenCL L AMDAPPSDKROOT lib x86 64 g o Template Template o 1SDKUtil lOpenCL LSAMDAPPSDKROOT lib x86 3 1 Compiling the Program 3 3 Copyright 2013 Advanced Micro Devices Inc All rights reserved 3 1 3 3 1 4 3 4 AMD ACCELERATED PARALLEL PROCESSING Supported Standard OpenCL Compiler Options The currently supported options are I dir Add the directory dir to the list of directories to be searched for header files When parsing include directives the OpenCL compiler resolves relative paths us
29. and initialized with a random pattern Also the actual min value for this data set is serially computed in order to later verify the parallel result 2 The compiler is instructed to dump the intermediate IL and ISA files for further analysis 3 The main section of the code including device setup CL data buffer creation and code compilation is executed for each device in this case for CPU and GPU Since the source memory buffer exists on the host it is shared All other resources are device specific 1 6 Example Programs 1 19 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING 4 The global work size is computed for each device A simple heuristic is used to ensure an optimal number of threads on each device For the CPU a given CL implementation can translate one work item per CL compute unit into one thread per CPU core On the GPU an initial multiple of the wavefront size is used which is adjusted to ensure even divisibility of the input data over all threads The value of 7 is a minimum value to keep all independent hardware units of the compute units busy and to provide a minimum amount of memory latency hiding for a kernel with little ALU activity 5 After the kernels are built the code prints errors that occurred during kernel compilation and linking 6 The main loop is set up so that the measured timing reflects the actual kernel performance If a sufficien
30. arrays the overflow data is placed spilled into scratch memory Scratch memory is a private subset of global memory so performance can be dramatically degraded if spilling occurs Global memory can be in the high speed GPU memory VRAM or in the host memory which is accessed by the PCle bus A work item can access global memory either as a buffer or a memory object Buffer objects are generally read and written directly by the work items Data is accessed through the L2 and L1 data caches on the GPU This limited form of caching provides read coalescing among work items in a wavefront Similarly writes are executed through the texture L2 cache Global atomic operations are executed through the texture L2 cache Atomic instructions that return a value to the kernel are handled similarly to fetch instructions the kernel must use 5 WAITCNT to ensure the results have been written to the destination GPR before using the data 1 5 3 Memory Access Using local memory known as local data store or LDS as shown in Figure 1 2 typically is an order of magnitude faster than accessing host memory through global memory VRAM which is one order of magnitude faster again than PCle However stream cores do not directly access memory instead they issue memory requests through dedicated hardware units When a work item tries to access memory the work item is transferred to the appropriate fetch unit The work item then is deactivated until the acce
31. flow control 4 breakpoint CL kernel function 3 host code 3 no breakpoint is set 2 setting 2 4 sample GDB debugging session 3 Setting Sj oe RPG NA KN eGo 8G KGG 2 buffer command queue 9 global sas scans de ordre 9 source or destination for instruction 10 storing writes to random memory locations 9 relationship sample code 4 build log printing out 16 built in function Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING amd bitalign 8 amd bytealign 8 amd lee chr she KAKA eda es 8 amd_pack 7 amd sad Jara teehee NAAN E ERE 8 amd_sad4 8 amd sadhi 9 amd unpack0 anaana 7 amd unpack1l anaana 7 amd_unpack2 7 amd_unpack3 7 built in functions for OpenCL language cl amd media ops 7 9 OpenCL C programs cl amd printf 12 variadic arguments 12 writing output to the stdout stream 12 burst wril secesi one oe bed AG RR os 10 C C front end compller te se ERA ANGLE KW 2 C kernels predefined macros 13 C program sample OpenGL
32. gt setX 5 5 6 Chapter 5 OpenCL Static C Programming Language Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Example Host Code Class Test setX int value private int x MyFunc tempClass new Test Some OpenCL startup cod create context queue etc cl mem classOb clCreateBuffer context CL MEM USE HOST PTR sizeof Test amp tempClass event clEnqueueMapBuffer classObj tempClass setX 10 clEnqueueUnmapBuffer classObj class is passed to the Devic clEnqueueNDRange fooKernel clEnqueueMapBuffer classObj class is passed back to the Host 5 4 Kernel Overloading This example shows how to define and use mangled name for kernel overloading and how to choose the right kernel from the host code Assume the following kernels are defined _ attribute mangled name testAddFloat4 kernel void testAdd global float4 srcl global float4 src2 global float4 dst f int tid get global id 0 dst tid srcl tid src2 tid attribute mangled name testAddInt8 kernel void testAdd global int8 srcl global int8 src2 global int8 dst f int tid get global id 0 dst tid srcl tid src2 tid The names testAddFloat4 and testAddInt8 are the external names for the two kernel instants When calling c1CreateKernel passing one of the
33. operations are processed where supported by connecting two or four of the ALUs excluding the transcendental core to perform a single double precision operation The processing element also contains one branch execution unit to handle branch instructions Different GPU compute devices have different numbers of processing elements For example the ATI Radeon HD 5870 GPU has 20 compute units each with 16 processing elements and each processing elements contains five ALUs this yields 1600 physical ALUs Appendix D Hardware overview of pre GCN devices Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Appendix E OpenCL OpenGL Interoperability This chapter explains how to establish an association between GL context and CL context Please note the following guidelines 1 All devices used to create the OpenCL context associated with command queue must support acquiring shared CL GL objects This constraint is enforced at context creation time clCreateContext and clCreateContextFromType fail context creation if the device list passed in cannot interoperate with the GL context clCreateContext only permits GL friendly device s clCreateFromContextType can only include GL friendly device s Use c1GetGLContextInfoKHR to determine GL friendly device s from the following parameters a CL CURRENT DEVICE FOR GL CONTEXT KHR only returns the device that can interoperate w
34. which writes into the pipe and another kernel consumer kernel which reads from the pipe In this example the producer writes a sequence of random numbers The consumer reads these numbers and creates a histogram of these numbers The producer kernel first reserves the amount of space in the pipe by invoking rid work group reserve write pipe rng pipe szgr Next it writes and commits into the pipe by invoking write pipe rng pipe rid lid amp gfrn work group commit write pipe rng pipe rid 6 6 Pipes Copyright 2013 Advanced Micro Devices Inc All rights reserved 6 23 AMD ACCELERATED PARALLEL PROCESSING Similarly the consumer pipe reserves the pipe for reading and reads from the pipe by issuing work group reserve read pipe rng pipe szgr read pipe rng pipe rid lid amp rn work group commit read pipe rng pipe rid The pipe thus provides a useful communication mechanism between two kernels 6 7 Sub groups 6 7 1 Overview OpenCL 2 0 introduces a Khronos sub group extension Sub groups are a logical abstraction of the hardware SIMD execution model akin to wavefronts warps or vectors and permit programming closer to the hardware in a vendor independent manner This extension includes a set of cross sub group built in functions that match the set of the cross work group built in functions specified in Section 6 4 6 8 Program scope global Variables 6 8 1 Overview Op
35. 15 A10 Extensi n Support Dy Device svc canaris A 15 Appendix B The OpenCL Installable Client Driver ICD B 1 B 2 Appendix C O AA IU UI E B 1 Using ICD B 1 OpenCL Binary Image Format BIF v2 0 C 1 C 2 Appendix D Oven cata ses bus ka a AG ana M ME C 1 C 1 1 Executable and Linkable Format ELF Header C 2 CRE BUDS E shaban heaeawts E A A O na C 3 AA C 3 Contents Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Hardware overview of pre GCN devices Appendix E OpenCL OpenGL Interoperability ET Unio NID ON ELI LE Em E 1 ETI Single GPU ERVIKOTHIIODE mama 0NG Rd IH Sek E 2 Creating CL Context from a GL Context ss E 2 E 1 2 Multi GPU Environment nee E 4 Creating CL context from a GL context ns E 4 ES CMOS AA E 7 E2 Linux Operating System ss E 7 E2 1 Single GPU Environment css E 7 Creating CL Context from a GL Context ss E 7 E22 Multi GPU Configurations E 10 Creating CL Context from a GL Context ss E 10 E 3 Additional GL Formats Supported ns E 13 Appendix F New and deprecated functions in OpenCL 2 0 F1 New built in functions in OpenCL 2 0 F 1 F 1 1 Work Item FUNGI NEN aNG nent prr HE cH te aan ont deed NE F 1 F 1 2 Ica ius F 1 F 1 3 Synchronization Functions maa kana mana nan
36. 21 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING for int n 0 n lt count n idx stride Mn Xn de pmin min pmin src idx x n pmin min pmin src idx y Na Ng pmin min pmin src idx z n pmin min pmin src idx w n Xn W Mn 12 Reduce min values inside work group Na Xn if get local id 0 0 Na Imin 0 uint 1 Mn W An barrier CLK LOCAL MEM FENCE An SA mi Xn void atom min lmin pmin Na Xn barrier CLK LOCAL MEM FENCE An m m n Write out to global n n if get local id 0 0 Na gmin get group id 0 lmin 0 Mn W Xn Dump some debug information Mn Na y if get global id 0 0 n n i dbg 0 get num groups 0 Mn dog 1 get global size 0 Mn dbg 2 count Na i dbg 3 stride Xn An y Mn Na 13 Reduce work group min values from global to _ global Mn Ant kernel void reduce global uint4 src Mn global uint gmin Na T n void atom min gmin gmin get global id 0 Na i Ants int main int argc char argv cl platform id platform int dev nw cl device type devs NDEVS CL DEVICE TYPE CPU CL DEVICE TYPE GPU cl uint src ptr unsigned int num src items
37. 4vectorUnit 1 Scalar Unit ila td S 1Scalar Unit 4 Vector Unit L1 LDS LDS L1 4 Vector Unit 1 Scalar Unit Q 9 E 1 Scalar Unit 4 Vector Unit L1 LDS LDS L1 4 Vector Unit 1Scalar Unit o A 1 Scalar Unit 4 Vector Unit L1 LDS E LDS L1 4VectorUnit 1Scalar Unit D E 1 Scalar Unit 4 Vector Unit L1 LDS d LDS L1 4 Vector Unit 1Scalar Unit le eis Li Scalar Unit 4 Vector Unit L1 LDS LDS L1 4Vector Unit 1 Scalar Unit Q o S 1 Scalar Unit 4 Vector Unit L1 LDS LDS L1 4VectorUnit 1Scalar Unit E o 4 1 Scalar Unit 4 Vector Unit L1 LDS LDS L1 4 Vector Unit 1Scalar Unit D Level 2 cache GDDR5 Memory System In Figure 2 5 there are two command processors which can process two command queues concurrently The Scalar Unit Vector Unit Level 1 data cache L1 and Local Data Share LDS are the components of one compute unit of which there are 32 The scalar SC cache is the scalar unit data cache and the Level 2 cache consists of instructions and data On GCN devices the instruction stream contains both scalar and vector instructions On each cycle it selects a scalar instruction and a vector instruction as well as a memory operation and a branch operation if available it issues one to the scalar unit the other to the vector unit this takes four cycles to issue over the four vector cores the same four cycles over which the 16 un
38. 6 2 6 22 AMD ACCELERATED PARALLEL PROCESSING concurrent accesses to the same pipe by multiple kernels even if permitted by hardware is undefined A pipe cannot be accessed from the host side Pipes are created on the host with a call to clCreatePipe and may be passed between kernels Pipes may be particularly useful when combined with device size enqueue for dynamically constructing computational data flow graphs There are two types of pipes a read pipe from which a number of packets can be read and a write pipe to which a number of packets can be written Note A pipe specified as read only cannot be written into and a pipe specified as write only cannot be read from A pipe cannot be read from and written into at the same time Functions for accessing pipes A new host API function has been added into the OpenCL 2 0 spec to create the Pipe cl mem clCreatePipe cl context context cl mem flags flags cl uint pipe packet size cl uint pipe max packets const cl pipe properties properties cl int errcode ret The memory allocated in the above function can be passed to kernels as read only or write only pipes Also a set of built in functions have been added to operate on the pipes Read pipe and write pipe functions int read pipe pipe gentype p gentype ptr Read packet from pipe p into ptr Returns 0 if read pipe is successful and a negative value otherwise int write pipe pipe gentype p co
39. Conventions AMD ACCELERATED PARALLEL PROCESSING The following conventions are used in this document mono spaced font A filename file path or code Any number of alphanumeric characters in the name of a code format parameter or instruction A range that includes the left most value in this case 1 but excludes the right most value in this case 2 A range that includes both the left most and right most values in this case 1 and 2 One of the multiple options listed In this case x or y A single precision 32 bit floating point value A double precision 64 bit floating point value A binary value in this example a 4 bit value A bit range from bit 7 to 4 inclusive The high order bit is shown first italicized word or phrase The first use of a term or concept basic to the understanding of stream computing Related Documents The OpenCL Specification Version 1 1 Published by Khronos OpenCL Working Group Aaftab Munshi ed 2010 The OpenCL Specification Version 2 0 Published by Khronos OpenCL Working Group Aaftab Munshi ed 2013 AMD R600 Technology R600 Instruction Set Architecture Sunnyvale CA est pub date 2007 This document includes the RV670 GPU instruction details ISO IEC 9899 TC2 International Standard Programming Languages C Kernighan Brian W and Ritchie Dennis M The C Programming Language Prentice Hall Inc Upper Saddle River
40. INT16 CL R CL SIGNED INT32 CL R CL UNSIGNED INT32 CL R CL FLOAT CL RG CL SIGNED INT32 CL RG CL UNSIGNED INT32 CL RG CL FLOAT CL RGB CL SIGNED INT32 CL RGB CL UNSIGNED INT32 CL RGB CL FLOAT Appendix E OpenCL OpenGL Interoperability Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Appendix F New and deprecated functions in OpenCL 2 0 F1 New built in functions in OpenCL 2 0 F 1 1 F 1 2 F 1 3 Work Item Functions get enqueued local size local sizes in uniform part of NDRange get global linear id unique 1D index for each work item in the NDRange get local linear id unique 1D index for each work item in the work group Integer functions ctz count trailing zero bits Synchronization Functions work group barrier replaces barrier adds scope AMD Accelerated Parallel Processing OpenCL Programming Guide F 1 Copyright 2013 Advanced Micro Devices Inc All rights reserved F 1 4 F 1 5 F 1 6 F 1 7 F 2 AMD ACCELERATED PARALLEL PROCESSING Address space quailfier functions to global to local to private get fence Atomic functions atomic init atomic work item fence atomic store explicit atomic load explicit atomic exchange explicit convert generic pointer to global pointer convert genericpointer to local pointer convert generic pointer to private pointer get fence appropriate to address space Ini
41. Instruction and Control Flow Branch Execution Unit ALUs Figure D 2 Simplified Block Diagram of an Evergreen Family GPU GPU compute devices comprise groups of compute units Each compute unit contains numerous processing elements which are responsible for executing kernels each operating on an independent data stream Processing elements in turn contain numerous processing elements which are the fundamental programmable ALUs that perform integer single precision floating point double precision floating point and transcendental operations All processing elements within a compute unit execute the same instruction sequence in lock step for Evergreen and Northern Islands devices different compute units can execute 1 Much of this is transparent to the programmer Copyright 2013 Advanced Micro Devices Inc All rights reserved D 3 D 4 AMD ACCELERATED PARALLEL PROCESSING different instructions A processing element is arranged as a five way or four way depending on the GPU type very long instruction word VLIW processor see bottom of Figure D 2 Up to five scalar operations or four depending on the GPU type can be co issued in a VLIW instruction each of which are executed on one of the corresponding five ALUs ALUs can execute single precision floating point or integer operations One of the five ALUs also can perform transcendental operations sine cosine logarithm etc Double precision floating point
42. OpenCL 1 2 to OpenCL 2 0 ns 6 28 6 11 2 Identifying implementation specifics ss 6 29 Contents ix Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Appendix A OpenCL Optional Extensions A 1 Extension Name Convention sine A 1 A2 Querying Extensions for a Platform turipis am maman A 1 A 3 Querying Extensions for a Device es A 2 A4 Using Extensions in Kernel Programs A 2 A5 Getting Extension Function Pointers ssseimssisimenssasmoneosonnnssinnsnonnesreonnansannsiinenn A 3 A6 List of Supported Extensions that are Khronos Approved sss A 3 A 4 A8 AMD Vendor Specific Extensions serment A 4 AST clami ee a A 4 AS2 Gl amd wecloacoussuecenncue denen dca nasci icm d eM CR REUS ki A 4 A8 3 cl amd device persistent memory A 4 A84 cl amd device attribute query nn A 5 cl device profiling timer offset amd A 5 Ol amd device Copolog Yi A 5 cl amd device board name A 5 ABS cL amd compile options nero A 6 A8 6 cl amd offline devices A 6 AST el amd event cal baka aaa mamana A 6 ABS GL and paponb aaa amahan aasahan A 7 A 8 9 cl amd media Ops een A 7 Pog GL aia ODA AG A 9 AST cl and ar aaa aha ain Tee err A 12 A8 12 cl amd predefined macros Unsa kaaa A 13 A8 13 cl amd bus addressable memory A 14 A 9 Supported Functions for cl amd fp64 cl khr p64 A
43. OxFF lt lt 24 Built in function amd unpack0 floatn amd unpack0 uintn src Return value for each vector component float src i amp OxFF Built in function amd unpack1 floatn amd unpackl uintn src Return value for each vector component float src i gt gt 8 amp OxFF Built in function amd unpack2 floatn amd unpack2 uintn src Return value for each vector component float src i gt gt 16 amp OxFF Built in function amd unpack3 floatn amd unpack3 uintn src Return value for each vector component A 8 AMD Vendor Specific Extensions A 7 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING float src i gt gt 24 amp OxFF Built in function amd_bitalign uintn amd bitalign uintn src0 uintn srcl uintn src2 Return value for each vector component uint long srcO i lt lt 32 Built in function amd bytealign uint uintn long srcl i gt gt src2 i amd bytealign uintn src0 uintn srcl uintn src2 Return value for each vector component Built in function amd_lerp srcO i gt gt srcO i gt gt CCCQ CC srcO i gt gt 16 amp OxFF 4 srcO i gt gt 24 amp OxFF uintn Long srcO i lt lt 32 long srcl i gt gt src2 i amp 3 8 amd lerp uintn src0 uintn srcl uintn src2 Return value for each vect
44. Yes only only only cl amd vec3 Yes Yes Yes Yes Yes Yes Yes Yes Images Yes Yes Yes Yes Yes Yes Yes Yes cl khr d3d10 sharing Yes Yes Yes Yes Yes Yes Yes Yes cl amd media ops Yes Yes Yes Yes Yes Yes Yes Yes cl amd printf Yes Yes Yes Yes Yes Yes Yes Yes cl amd popcnt Yes Yes Yes Yes Yes Yes Yes Yes cl khr 3d image writes Yes Yes Yes Yes Yes Yes Yes Yes Platform Extensions cl khr icd Yes Yes Yes Yes Yes Yes Yes Yes cl amd event callback Yes Yes Yes Yes Yes Yes Yes Yes cl amd offline devices Yes Yes Yes Yes Yes Yes Yes Yes 1 AMD Radeon HD 79XX series 2 AMD Radeon HD 78XX series 3 AMD Radeon HD 77XX series 4 AMD Radeon HD 75XX series and AMD Radeon HD 76XX series 5 AMD Radeon HD 69XX series 6 AMD Radeon HD 68XX series 7 ATI Radeon HD 59XX series and 58XX series AMD FirePro V88XX series and V87XX series Note that an atomic counter is a device level counter that can be added decremented by different work items where the atomicity of the operation is A 9 Supported Functions for c1 amd fp64 cl khr fp64 A 15 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING guaranteed The access to the counter is done only through add dec built in functions thus no two work items have the same value returned in the case that a given kernel only increments or decrements the counter Also see http www khronos org registry cl extensions ext cl ext atomic counters 32 txt Tab
45. above If the number of interoperable devices is zero use g1XDestroyContext to destroy the context created at step h and go to step a otherwise exit from the loop an OpenCL OpenGL interoperable device has been found E 2 Linux Operating System E 11 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING 6 Use ciGetGlContextInfoKHR to get the OpenCL OpenGL interoperable device id 7 Use clCreateContext to create the context on the interoperable device obtained in the previous step The following code segment shows how to create an OpenCL OpenGL interoperability context on a system with multiple GPUs displayName XOpenDisplay NULL int screenNumber ScreenCount displayName XCloseDisplay displayName for inti 0 i lt screenNumber i if isDeviceIdEnabled if i lt deviceld char disp 100 sprintf disp DISPLAY 0 d i putenv disp displayName XOpenDisplay 0 int nelements GLXFBConfig fbc glXChooseFBConfig displayName DefaultScreen displayName continue 0 amp nelements static int attributeList GLX RGBA GLX_DOUBLEBUFFER GLX RED SIZE 1 GLX GREEN SIZE 1 1 GLX BLUE SIZE Li None XVisualInfo vi glXChooseVisual displayName DefaultScreen displayName attributeList XSetWindowAttributes swa swa colo
46. amd_unpack1 built in function amd_unpack2 built in function amd_unpack3 built in function API platform querying processing calls API commands three categories application code developing Visual Studio application kernels device specific binaries arrangement of ALUS atomics barrier command queue barriers execution order work group work items encountering BIF Index 2 ACCELERATED PARALLEL PROCESSING comment storing OpenCL and driver versions that created the binary 1 Ivmir storing immediate representation LLVM j 1 source storing OpenCL source program 1 binary comment section 1 eJ oaeee alan GA LA NAKAT dis 3 changing default behavior 1 ELF special sections 1 options to control what is contained in the DINA 2b ed b bte ben 3 OVerview 1 binary application kernels 2 controlling BIF options 3 CPU Su outre Rene Re ae tea 1 generating in OpenCL 1 LLVM AS 1 GPU s occur REPRE HER 1 Binary Image Format BIF See BIF bitness BIE sua iive xcu RE ba 3 branch granularity work item 0 eee 4 instructions 4 branching
47. an extension is present search the string for a specified substring A 4 Using Extensions in Kernel Programs A 2 There are special directives for the OpenCL compiler to enable or disable available extensions supported by the OpenCL implementation and specifically by the OpenCL compiler The directive is defined as follows pragma OPENCL EXTENSION lt extention name gt lt behavior gt pragma OPENCL EXTENSION all lt behavior gt The lt extension name gt is described in Section A 1 Extension Name Convention The second form allows to address all extensions at once The lt behavior gt token can be either e enable the extension is enabled if it is supported or the error is reported if the specified extension is not supported or token all is used e disable the OpenCL implementation compiler behaves as if the specified extension does not exist e all only core functionality of OpenCL is used and supported all extensions are ignored If the specified extension is not supported then a warning is issued by the compiler The order of directives in pragma OPENCL EXTENSION is important a later directive with the same extension name overrides any previous one The initial state of the compiler is set to ignore all extensions as if it was explicitly set with the following directive pragma OPENCL EXTENSION all disable Appendix A OpenCL Option
48. buffers 4 executing kernels for specific devices 3 queues of commands 3 reading writing data 3 providing an event 3 querying extensions 1 supported extensions using clGetPlat forminfo 1 read data back to the host from device 4 recompiling LLVM IR to generate a new code 1 re written applications become source 1 running data parallel work 3 programs 1 task parallel work 3 runtlme o ae heeded ay ene beg 1 changing options 6 Interface scere RR ERA REA 2 post processes the incomplete AMD IL from OpenCL compiler 1 using LLVM AS 1 setting breakpoint 2 settings for compiling on Windows 2 storing immediate representation LLVM IR Ilvmir oec I RR RR Re 1 storing OpenCL and driver versions comment 2255 cs ube rete ga 1 storing source program SOU CO uua PA ares 1 submit the kernel for execution 4 supported standard compiler options 4 synchronizing a given event 3 write data to device 4 OpenCL device general overview 6 OpenCL programs debugging naana 1 operation mem fence aunan 4 operations device specific kernel execution 4 program
49. cl char device cl char function pcie cl device topology amd The type of the structure returned can be queried by reading the first unsigned int of the returned data The developer can use this type to cast the returned union into the right structure type Currently the only supported type in the structure above is PCle type value 1 The information returned contains the PCI Bus Device Function of the device and is similar to the result of the 1spci command in Linux It enables the developer to match between the OpenCL device ID and the physical PCI connection of the card cl amd device board name This query enables the developer to get the name of the GPU board and model of the specific device Currently this is only for GPU devices Calling c1GetDeviceInfo with param name set to CL DEVICE BOARD NAME AMD returns a 128 character value A 8 AMD Vendor Specific Extensions A 5 Copyright 2013 Advanced Micro Devices Inc All rights reserved A 8 5 AMD ACCELERATED PARALLEL PROCESSING cl amd compile options This extension adds the following options which are not part of the OpenCL specification e g This is an experimental feature that lets you use the GNU project debugger GDB to debug kernels on x86 CPUs running Linux or cygwin minGW under Windows For more details see Chapter 4 Debugging OpenCL This option does not affect the default optimization of the OpenCL code e 00
50. compilation 4 double precision floating point 4 Integer eda eo eae eee pew DAANG 4 memory read 10 memory write 10 single precision floating point 4 optimizer Index 11 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING transfer of data 2 optional extensions for OpenCL overloading in C language 4 kernel 2 kernel and function 1 overview software and hardware 1 P parallel min function code sample 21 example programs 19 kernel code 20 programming techniques 19 runtime code 19 SEPS is ah it pasthamedianate ras 19 parallel programming memory fence baitlers ios kx rb Npa be 3 operations 3 parallelism enqueuing multiple tasks 2 native kernels 2 using vector data types 2 parallelization DMA transfers 9 GPU o kaaa ue Hel ba Pe ae es 10 passing a class between host to the device 6 path to the kernel file relative to executable 6 7 PCle communication between system and GPU 8 data transfers between system and GPU 8 ker
51. data is read from LDS and placed into the work item s registers then placed into global memory To make effective use of the LDS an algorithm must perform many operations on what is transferred between Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING global memory and LDS It also is possible to load data from a memory buffer directly into LDS bypassing VGPRs LDS atomics are performed in the LDS hardware Thus although ALUs are not directly used for these operations latency is incurred by the LDS executing this function If the algorithm does not require write to read reuse the data is read only it usually is better to use the image dataflow see right side of Figure 1 5 because of the cache hierarchy Actually buffer reads may use L1 and L2 When caching is not used for a buffer reads from that buffer bypass L2 After a buffer read the line is invalidated then on the next read it is read again from the same wavefront or from a different clause After a buffer write the changed parts of the cache line are written to memory Buffers and images are written through the texture L2 cache but this is flushed immediately after an image write The data in private memory is first placed in registers If more private memory is used than can be placed in registers or dynamic indexing is used on private
52. for display when the ancestor becomes mapped Such a window is called unviewable When all its ancestors are mapped the window becomes viewable and is visible on the screen if it is not obscured by another window Use giXCreateContextAttribsARB to initialize the context to the initial state defined by the OpenGL specification and returns a handle to it This handle can be used to render to any GLX surface Use glXMakeCurrent to make argrument3 GLXContext the current GLX rendering context of the calling thread replacing the previously current context if there was one and attaches argument3 GLXcontext to a GLX drawable either a window or a GLX pixmap Use clGetGLContextInfoKHR to get the OpenCL OpenGL interoperability device corresponding to the window created in step 5 Use clCreateContext to create the context on the interoperable device obtained in step 9 The following code snippet shows how to create a CL GL interoperability context using the X Window system in Linux Display displayName XOpenDisplay 0 int nelements GLXFBConfig fbc glXChooseFBConfig displayName DefaultScreen displayName 0 amp nelements static int attributeList GLX RGBA X DOUBLEBUFFER X RED SIZE X BLUE SIZE G G 1 GLX GREEN SIZE 1 G 1 N XVisualInfo vi glXChooseVisual displayName DefaultScreen displayName attributeList XSetWindowAtt
53. from a GL Context Using GLUT E 2 Linux Operating System Copyright 2013 Advanced Micro Devices Inc All rights reserved E 7 E 8 AMD ACCELERATED PARALLEL PROCESSING Use glutInit to initialize the GLUT library and to negotiate a session with the windowing system This function also processes the command line options depending on the windowing system Use g1XGetCurrentContext to get the current rendering context GLXContext Use g1XGetCurrentDisplay to get the display Display that is associated with the current OpenGL rendering context of the calling thread Use clGetGLContextInfoKHR see Section 9 7 of the OpenCL Specification 1 1 and the CL CURRENT DEVICE FOR GL CONTEXT KHR parameter to get the device ID of the CL device associated with the OpenGL context Use clCreateContext see Section 4 3 of the OpenCL Specification 1 1 to create the CL context of type c1 context The following code snippet shows how to create an interoperability context using GLUT in Linux glutInit amp argc argv glutInitDisplayMode GLUT RGBA GLUT DOUBLE glutInitWindowSize WINDOW WIDTH WINDOW HEIGHT glutCreateWindow OpenCL SimpleGL gGLXContext glCtx glXGetCurrentContext Cl context properties cpsGL CL CONTEXT PLATFORM c1 context properties platform CL GLX DISPLAY KHR intptr t glXGetCurrentDisplay CL GL CONT
54. instructions at once in a very long instruction word VLIW packet AMD Accelerated Parallel Processing OpenCL Programming Guide D 1 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Figure D 1 shows a simplified block diagram of a generalized AMD GPU compute GPU GPU Compute Device Compute Device device ALUs donus ee Figure D 1 Generalized AMD GPU Compute Device Structure Figure D 2 is a simplified diagram of an AMD GPU compute device Different GPU compute devices have different characteristics such as the number of compute units but follow a similar design pattern Appendix D Hardware overview of pre GCN devices Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Ultra Threaded Dispatch Processor UTDP Compute Compute Unit Unit Processing Element General Purpose Registers Compute Unit
55. its products including but not limited to the implied warranty of merchantability fitness for a particular purpose or infringement of any intellectual property right AMD S products are not designed intended authorized or warranted for use as compo nents in systems intended for surgical implant into the body or in other applications intended to support or sustain life or in any other application in which the failure of AMD s product could create a situation where personal injury death or severe property or envi ronmental damage may occur AMD reserves the right to discontinue or make changes to its products at any time without notice AMD Advanced Micro Devices Inc One AMD Place P O Box 3453 Sunnyvale CA 94088 3453 www amd com For AMD Accelerated Parallel Processing URL developer amd com appsdk Developing developer amd com Forum developer amd com openclforum AMD ACCELERATED PARALLEL PROCESSING Preface About This Document Audience Organization This document provides a basic description of the AMD Accelerated Parallel Processing environment and components It describes the basic architecture of compute and stream processors This document also provides a guide for programmers who want to use AMD Accelerated Parallel Processing to accelerate their applications This document is intended for programmers It assumes prior experience in writing code for CPUs and a basic understanding of threads work items
56. keys key count unsigned int globalUpperIndex globalLowerIndextsubdivSize for keys key count 1 if subdivSize for keys key count lt global threads continue subdivSize for keys key count globalUpperIndex globalLowerIndex 1 global threads ndrange t ndrangel ndrange 1D global threads void binarySearch device enqueue wrapper blk void binarySearch device enqueue outputArray sortedArray subdivSize for keys key count globalLowerIndex keys key count global threads key count parent globalids int err ret nqueue kernel def q CLK ENQUEUE FLAGS NO WAIT ndrangel binarySearc h device enqueue wrapper blk if err ret 0 outputArray key count w 2 outputArray key count z err ret return In the OpenCL 2 0 version each work item checks for each key if it is found in its search range If the key is found it further divides the range into chunks and enqueues the kernel for further processing The advantage is that when the input array is large the OpenCL 2 0 version divides the input array into 1024 sized chunks The chunk in which the given key falls is found and another kernel is enqueued which further divides it into 1024 6 4 Device side enqueue 6 17 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING sized chunks and so on In OpenCL 1 2 as the whole array is taken as
57. new delete support is not provided Exceptions no support for throw catch The operator STL and other standard C libraries The language specified in this extension can be easily expanded to support these features 5 1 3 Relations with ISO IEC C This extension focuses on documenting the differences between the OpenCL Static C kernel language and the ISO IEC Programming languages C specification Where possible this extension leaves technical definitions to the ISO IEC specification 5 2 Additions and Changes to Section 5 The OpenCL C Runtime 5 21 Additions and Changes to Section 5 7 1 Creating Kernel Objects In the static C kernel language a kernel can be overloaded templated or both The syntax explaining how to do it is defined in Sections 5 3 4 and 5 3 5 below To support these cases the following error codes were added these can be returned by clCreateKernel CL INVALID KERNEL TEMPLATE TYPE ARGUMENT AMD if a kernel template argument is not a valid type is neither a valid OpenCL C type or a user defined type in the same source file CL INVALID KERNEL TYPE ARGUMENT AMDif a kernel type argument used for overloading resolution is not a valid type is neither a valid OpenCL C type or user defined type in the same source program 5 2 Chapter 5 OpenCL Static C Programming Language Copyright 2013 Advanced Micro Devices Inc All rights reserved A
58. operation is applied to other components of the vectors Built in Function intn amd median3 intn src0 intn srcl intn src2 uintn amd median3 uintn src0 uintn srcl uintn src2 floatn amd median3 floatn src0 floatn srcl floatn src2 Description Returns median of src0 src1 and src2 Built in Function intn amd min3 intn src0 intn srcl intn src2 uintn amd min3 uintn src0 uintn srcl uintn src2 floatn amd min3 floatn src0 floatn srcl floatn src2 Description Returns min of src0 src1 and src2 Built in Function intn amd max3 intn src0 intn srcl intn src2 uintn amd max3 uintn src0 uintn srcl uintn src2 floatn amd max3 floatn src0 floatn srcl floatn src2 Description Returns max of src0 src1 and src2 For more information see http www khronos org registry cl extensions amd cl amd media ops2 txt A 8 AMD Vendor Specific Extensions A 11 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING A 8 11cl amd printf The OpenCL Specification 1 1 and 1 2 support the optional AMD extension cl amd printf which provides printf capabilities to OpenCL C programs To use this extension an application first must include pragma OPENCL EXTENSION cl amd printf enable Built in function printf constant char restrict format This function writes output to the stdout stream associated with the host application The
59. program 2 clCreateKernel C extension 2 clEnqueue commands 5 clEnqueueNDRangeKernel setting breakpoint in the host code 3 clGetPlatformIDs function available OpenCL implementations 1 clGetPlatformInfo function available OpenCL implementations 1 querying supported extensions for OpenCL platform meurent ey AA 1 C like language OpenCL 244 24 sesceeuiaodedion a4 aoe 3 code basic programming steps 10 ICD compliant version 12 parallel min function 19 pre ICD snippet 1 2 running on Linux 7 on Windows 6 runtime steps 19 code requirements Installable Client Driver ICD 1 CodeXL GPU Debugger 1 command processor transfer from system to GPU 9 command processors concurrent processing of command queues 7 command queue 4 associated with single device 4 Index 3 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING barrier enforce ordering within a single queue 4 creating device specific 4 elements constants 44 Na 4h KA ham EWAN 9 kernel execution calls 9 kernels 9 transfers between device and host 9 executing kernels
60. resources pX and pY OpenCL resources is cleaned up by the C bindings support code catch cl Error err block handles exceptions thrown by the C r the error code is CL BUILD PROGRAM FAILURE in which case it is necessary to print out the build log Example Code 2 define CL ENABLE EXCEPTIONS finclude CL cl hpp finclude string finclude iostream finclude string using std cout using std cerr using std endl using std string MIMM MM MMMLMPPMPMCLPMMMMLP IE BG MIL Helper function to print vector elements ARR IE lll void printVector const std string arrayName const cl float arrayData const unsigned int length int numElementsToPrint 256 lt length 256 length cout lt lt endl lt lt arrayName lt lt lt lt endl for int i 0 i lt numElementsToPrint i cout lt lt arrayData i lt lt cout lt lt endl ANAKAN ANA KAKANAN NAKAKAKAIN AA ANA MP MB ATTA MM Globals uM d int length cl float pX nek cl float pY NULL cl float a 2 f std vector lt cl Platform gt platforms cl Context context std vector lt cl Device gt devices cl CommandQueue queue el Program program Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING cl Ker
61. synchronization command queue barrier 4 dOMAINS i dier eb qewESX Adda OA 4 command queue 4 work items 4 GVehtS 2v anad ix BELA dd ERE on 4 points in a kernel annann 2 synchronizing a given event 3 event enforce the correct order of execution 3 through barrier operations work items 3 through fence operations work items 3 syntax GCC option 3 system pinned memory 7 T templates CHa somme ADS BAKAS Utes wees 5 kernel member default argument limited class partial 1 terminology 1 texture system caching Lars ssaenrsehunvererisause 10 thread launching 19 threading device optimal access pattern 19 Index 13 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING throughput PGI see oo ak Bes RR ae eRe 8 timing of simplified execution of work items single stream core 10 toolchain compiler 1 transcendental COG aaa na AALAGA Dh h bales vind 4 performing operations 3 transfer between device and host command queue elements 9 data select a device 4 to the optimizer 2 DMA y ed otek ar Sha daa ele eee dela 9 from system t
62. the NDRange a huge number of work groups require processing The following figure shows how the OpenCL 2 0 version compares to the OpenCL 1 2 as the array increases beyond a certain size Performance comparison v s 1 2 1400 1200 1000 Kernel Time 800 Milli secs 600 400 200 0 mm mum Eu 1 10 100 1000 2000 1 10 100 1000 2000 E OpenCL1 2 E OpenCL2 0 100K 100K 100K 100K 100K 10M 10M 10M 10M 10M Keys amp Samples The above figure shows the performance benefit of using OpenCL 2 0 over the same sample using OpenCL 1 2 In OpenCL 2 0 the reduced number of kernel launches from the host allow superior performance The kernel enqueues are much more efficient when done from the device 6 5 Atomics and synchronization 6 5 1 6 5 2 6 18 Overview Usage In OpenCL 1 2 only work items in the same workgroup can synchronize OpenCL 2 0 introduces a new and detailed memory model which allows developers to reason about the effects of their code on memory and in particular understand whether atomic operations and fences used for synchronization ensure the visibility of variables being used to communicate between threads In conjunction with the new memory model OpenCL 2 0 adds a new set of atomic built in functions and fences derived from C 11 although the set of types is restricted and also deprecates the 1 2 atomic built in functions and fe
63. the device handle HWND Appendix E OpenCL OpenGL Interoperability Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Use GetDC to get a handle to the device context for the client area of a specific window or for the entire screen OR Use CreateDc function to create a device context HDC for the specified device Use ChoosePixelFormat to match an appropriate pixel format supported by a device context and to a given pixel format specification Use SetPixelFormat to set the pixel format of the specified device context to the format specified Use wglCreateContext to create a new OpenGL rendering context from device context HDC Use wglMakeCurrent to bind the GL context created in the above step as the current rendering context Use clGetGLContextInfoKHR function see Section 9 7 of the OpenCL Specification 1 1 and parameter CL CURRENT DEVICE FOR GL CONTEXT KHR to get the device ID of the CL device associated with OpenGL context Use clCreateContext function see Section 4 3 of the OpenCL Specification 1 1 to create the CL context of type c1 context The following code snippet shows how to create an interoperability context using WIN32 API for windowing Users also can refer to the SimpleGL sample in the AM int D APP SDK samples pfmt PIX pfd pfd pfd pfd pfd pfd pfd pfd pfd pfd pfd pfd pfd pfd
64. the work group After log n stages the prefix sums for the complete out array are computed This gives index of the final primes array During every stage work_group_broadcast is used to broadcast the prefix sum of the last element of the previous workgroup The following figures show the outputs of various arrays at various stages Input 12 31 47 64 19 27 49 81 99 11 Assuming workgroup size is 4 After setPrimes Primes 0 1 1 0 1 0 0 0 0 1 After group scan kernel out aasa o JI After global_scan_kernel out 0 1 2 2 3 3 3 3 3 4 Finally the device enqueue feature is employed Device enqueue enqueues these kernels in sequence instead of host enqueing them This improves performance As these kernels must be run in sequence they cannot be clubbed into one kernel outPrimes 31 47 19 11 6 4 Device side enqueue 6 13 Copyright 2013 Advanced Micro Devices Inc All rights reserved 6 14 AMD ACCELERATED PARALLEL PROCESSING The host launches the setPrimes group scan and global scan kernels in sequence to set the primes Boolean array performing prefix sum calculation at the workgroup level and perform the prefix sum across workgroup levels respectively The output after each stage is shown in the above figure The performance of this OpenCL 2 0 version is compared with that of OpenCL 1 2 ver
65. the work group 2 size is allocated per work group 2 library SDKUHIL 3 linking creating an executable 3 in the built in OpenCL functions 2 object files 3 OpenCL on Linux 3 options SDKUtil library 3 to a 32 bit library compiling on Linux 3 to a 64 bit library compiling on Linux 3 Linux building 32 bit object files on a 64 bit system 3 calling convention _cdecl 7 compiling OpenCL 3 linKlng zc ize ek REP ERR 3 linking to a 32 bit library 3 to a 64 bit library 3 linking options SDKUtil library 3 running code 7 SDKUtil library 3 list of supported extensions approved by the Khronos Group 3 LLVM compiler 2 framework compiler 2 Ic PPP 2 LLVM AS CPU processing 2 generating binaries 1 LLVM IR BIF su dant spa ee aha 1 compatibility 1 enabling recompilation to the target 1 front end translation 2 generating a new code 1 LLVM IR to CAL IL module GPU processing 2 Local Data Store LDS See LDS loop Dci kka NG bade die da NA 5 Low Level Virtual
66. 1 1 MEE eV o M 1 2 13 Programming Model ini aerem ii asia tacta za as as ak sk 1 3 ME 0000 isle 14 1 5 Memory Architecture and ACCESS trennt treten tentent tentent triente tnnc 1 5 1 5 1 Data Share Operations Me AHA 1 7 1 5 2 Dataflow in Memory Hierarchy ss 1 8 12 3 Momo ACCESS ANAN 1 9 154 Global Memory samasama RATAN 1 10 15 5 Image Read WING domi ccc ana caca maa koci cia aaa ssa me 1 10 15 Example Pon aNG tix b pi tesa ND DEBT MM MU NGA eer lS 1 10 1 6 1 First Example Simple Buffer Write seen 1 10 1 6 2 Example SAXPY PUOI uui tincisok snd ct anan SUE i Fern cud 1 14 1 6 3 Example Parallel Min Function sisi tuor techo ra sis ttc ru tkece 1 19 Chapter 2 AMD Implementation 2 1 2 2 2 3 24 The AMD Accelerated Parallel Processing Implementation of OpenCL 2 1 2 1 1 Work tem Processing Janna 2 4 21 2 Worleltem CANON paaa maaaninag aaa 2 4 213 FLOW COON PAA PAA 2 4 Hardware Overview for GCN Devices nn 2 6 2 2 1 Key differences between pre GCN and GCN devices eene 2 7 2 2 2 Key differences between Southern Islands Sea Islands and Volcanic Islands families 2 8 Communication Between Host and the GPU Compute Device 2 8 2 3 1 Processing API Calls The Command Processor 2 9 2 3 2 DMA Transfo S rendent 2 9 23 9 Masking Visible Devices 2 10 Wavetrent Sc
67. 1 3 Programming Model 1 3 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Many operations are performed with respect to a given context there also are many operations that are specific to a device For example program compilation and kernel execution are done on a per device basis Performing work with a device such as executing kernels or moving data to and from the device s local memory is done using a corresponding command queue A command queue is associated with a single device and a given context all work for a specific device is done through this interface Note that while a single command queue can be associated with only a single device there is no limit to the number of command queues that can point to the same device For example it is possible to have one command queue for executing kernels and a command queue for managing data transfers between the host and the device Most OpenCL programs follow the same pattern Given a specific platform select a device or devices to create a context allocate memory create device specific command queues and perform data transfers and computations Generally the platform is the gateway to accessing specific devices given these devices and a corresponding context the application is independent of the platform Given a context the application can e Create one or more command queues e Create programs to run on one or more associate
68. 4 execution 9 moving data 4 no limit of the number pointing to the same device cse bk Wa PERLE E EAR 4 OpenCL 4 command queues 7 multiple 6 command queue barrier 4 commands API three categories 6 buffer sut aupres nes 9 clEnqueue 5 driver layer issuing 9 driver layer translating 9 event nae ae hake eee he sees 6 GDB AA ee TELS MEX 3 Kernel cei 444 Oo yews dard rg ed 6 MEMON cs 1 ee 6 OpenCL API functions 5 QUGllG sirgua aran a ed edd Beate teat a 9 communication and data transfers between sys tem and GPU POIS ic sesta euet sets 8 communication between the host CPU and the GPU MMC nane kadia pag a hiaai 8 compilation error kernel 16 compile time resolving format string 12 compiler LLVM framework naaaa aanas 2 set to ignore all extensions 2 toolchain 1 back end 1 OpenCL 1 sharing front end 1 sharing high level transformations 1 transformations 1 using standard C front end 2 compiler option f no bin amdil 4 f no bin exe 4 I
69. 96 10000 20000 40000 80000 160000 320000 640000 e OpenCL 1 2 Secs OpenCL 2 0 Secs As the size of the input array exceeds 4K the difference in performance between the OpenCL 1 2 implementation and the OpenCL 2 0 implementation becomes more apparent In OpenCL 1 2 the time taken is O n time and the work done is O n 2 in OpenCL 2 0 the time taken is O log n and the work done is O n assuming the prefix sum at the group level is constant time 6 4 2 4 Binary search using device enqueue or kernel enqueue The power of device enqueue is aptly illustrated in the example of binary search To make the problem interesting multiple keys in a sorted array will be searched for The versions written for OpenCL 1 2 and 2 0 will also be compared with respect to programmability and performance The OpenCL 1 2 version of the code that performs binary search is as follows kernel void binarySearch mulkeys global int keys global uint input const unsigned int numKeys global int output int gid get global id 0 int IBound gid 256 6 4 Device side enqueue 6 15 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING int uBound IBound 255 for int i 0 i lt numKeys i if keys i gt input IBound amp amp keys i input uBound output i lBoung The search for multiple keys is done sequentially while the sorted array is divided into 256 si
70. A 8 4 1 8 4 1 1 8 4 2 1 8 4 3 AMD ACCELERATED PARALLEL PROCESSING cl amd device attribute query This extension provides a means to query AMD specific device attributes To enable this extension include the pragma OPENCL EXTENSION cl amd device attribute query enable directive Once the extension is enabled and the clGetDeviceInfo parameter lt param name gt is set to CL DEVICE PROFILING TIMER OFFSET AMD the offset in nano seconds between an event timestamp and Epoch is returned cl device profiling timer offset amd This query enables the developer to get the offset between event timestamps in nano seconds To use it compile the kernels with the pragma OPENCL EXTENSION cl amd device attribute query enable directive For kernels complied with this pragma calling c1GetDeviceInfo with param name set to CL DEVICE PROFILING TIMER OFFSET AMD returns the offset in nano seconds between event timestamps cl amd device topology This query enables the developer to get a description of the topology used to connect the device to the host Currently this query works only in Linux Calling clGetDeviceInfo With param name set to CL DEVICE TOPOLOGY AMD returns the following 32 bytes union of structures typedef union struct cl_uint type cl_uint data 5 raw struct cl_uint type cl char unused 17 cl char bus
71. AMD AMD Accelerated Parallel Processing OpenCL User Guide October 2014 2013 Advanced Micro Devices Inc All rights reserved AMD the AMD Arrow logo AMD Accelerated Parallel Processing the AMD Accelerated Parallel Processing logo ATI the ATI logo Radeon FireStream FirePro Catalyst and combinations thereof are trade marks of Advanced Micro Devices Inc Microsoft Visual Studio Windows and Windows Vista are registered trademarks of Microsoft Corporation in the U S and or other jurisdic tions Other names are for informational purposes only and may be trademarks of their respective owners OpenCL and the OpenCL logo are trademarks of Apple Inc used by permission by Khronos The contents of this document are provided in connection with Advanced Micro Devices Inc AMD products AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice The information contained herein may be of a preliminary or advance nature and is subject to change without notice No license whether express implied arising by estoppel or other wise to any intellectual property rights is granted by this publication Except as set forth in AMD s Standard Terms and Conditions of Sale AMD assumes no liability whatsoever and disclaims any express or implied warranty relating to
72. AMD ACCELERATED PARALLEL PROCESSING sampler t imageSampler global float xOffsets global float yOffsets global float results f int tidX get global id 0 tidY get global id 1 int offset tidY get image width input tidX int2 coords int2 xOffsets offset yOffsets offset results offset read imagef input imageSampler coords The AMD OpenCL 2 0 platform fully supports the c1 khr depth images extension but not the cl khr gl depth images extension Consequently the AMD OpenCL platform does not support creating a CL depth image from a GL depth or depth stencil texture 6 10 Non uniform work group size 6 10 1 Overview Prior to OpenCL 2 0 each work group size needed to divide evenly into the corresponding global size This requirement is relaxed in OpenCL 2 0 the final work group in each dimension is allowed to be smaller than all of the other work groups in the uniform part of the NDRange This can reduce the effort required to map problems onto NDRanges A consequence is that kernels may no longer assume that calls to get work group size return the same value in all work groups However a new call get enqueued local size has been added to obtain the size in the uniform part which is specified using the local work size argument to the clEnqueueNDRangeKernel A new compile time option c1 uniform work group size has been added to optimize the computation for cases in which the wor
73. CL Accelerated Parallel Processing implementation adding built in functions to the language cl_amd_media_ops extension allocating images API applications execution Binary Image Format BIF Overview ee building create a context programs querying the runtime the application C printf C programming call error checking for known symbols C like language with extensions for parallel programming coding esaera KG abe badd osent compiler toolchallr ioni see eve KA au compiler and runtime components compiler options D name dir compiling On LINUX c sues pA yee BAe EE ER Aking na Face aid RUE re eR EUR RUP on Windows the program context memory object allocation conversion guidelines format string CPU processing create kernels within programs create one or more command queues create programs to run on one or more devices iso A eR RE RES Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING creating a context selecting a device 4 data parallel model hierarchical subdivision 2 debugging 1 clBuildProgram 2 desired platform 3 selection 3 di
74. Copyright c 2010 Advanced Micro Devices Inc All rights reserved A minimalist OpenCL program include lt CL cl h gt include lt stdio h gt define NWITEMS 512 A simple memset kernel const char source kernel void memset global uint dst n dst get global id 0 get global id 0 2 int main int argc char argv 1 Get a platform cl platform id platform clGetPlatformIDs 1 amp platform NULL 2 Find a gpu device cl device id device clGetDeviceIDs platform CL DEVICE TYPE GPU 1 amp device NULL Mn Na Na An Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Copyright c 2010 Advanced Micro Devices Inc All rights reserved A minimalist OpenCL program include lt CL cl h gt include lt stdio h gt define NWITEMS 512 A simple memset kernel const char source kernel void memset global uint dst n TEE Na dst get global id 0 get global id 0 Na 5 ibi Hm An int main int argc char argv 1 Get a platform cl platform id platform clGetPlatformIDs 1 amp platform NULL 2 Find a gpu device cl device id device clGetDeviceIDs platform CL DEVICE TYPE GPU 1 amp device
75. EC INFO SVM PTRS CL KERNEL EXEC INFO SVM FINE GRAIN SYSTEM CL COMMAND SVM FREE CL COMMAND SVM MEMCPY CL COMMAND SVM MEMFILL CL COMMAND SVM MAP CL COMMAND SVM UNMAP CL PROFILING COMMAND COMPLETE F3 New runtime APIs in OpenCL 2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved F 7 F 3 3 AMD ACCELERATED PARALLEL PROCESSING New API calls clCreateCommandQueueWithProperties clCreatePipe clGetPipelnfo cISVMAlloc cISVMFree clEnqueueSVMFree clEnqueueSVMMemcpy clEnqueueSVMMemrFill clEnqueueSVMMap clEnqueueSVMUnmap clCreateSamplerWithProperties clSetKernelArgSVMPointer clSetKernelExecinfo clGetKernelSubGroupInfoKHR F 4 Deprecated runtimes F 8 clCreateCommandQueue clCreateSampler clEnqueueTask See section 5 1 See section 5 4 1 See section 5 4 2 See section 5 6 1 See section 5 6 1 See section 5 6 1 See section 5 6 1 See section 5 6 1 See section 5 6 1 See section 5 6 1 See section 5 7 1 See section 5 9 2 See section 5 9 2 See section 9 17 2 1 Appendix F New and deprecated functions in OpenCL 2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Index Symbols _cdecl calling convention LINUX SPP 7 _global atomics 19 local atomics 19 _Stdcall calling convention Windows 2 Pa amdil generating 3 comment BI
76. EL PROCESSING the matrix B is computed the row i is computed The row which has all 1s will be the maximum C i kernel void atomicMax volatile global int A global int B global int C global int P f int i get global id 0 int j get global id 1 int N P k if A i gt A j B i N j 1 lse B i N 0 VE G 0 for k 0 k lt N k atomic fetch and explicit global atomic int amp C i B i N k memory order release memory scope device Similarly another sample includes the following kernel that increments 2 N times N times in the kernel and another N times on the host kernel void counter global int count atomic fetch add atomic int count 1 count Note If atomic fetch add is not used and instead an incrementing count as performed in the commented line is used the sum will not be computed correctly OpenCL 2 0 introduces a new mechanism pipes for passing data between kernels A pipe is essentially a structured buffer containing space for some number of a single kernel specified type of packet and bookkeeping information Pipes are accessed via special read pipe and write pipe built in functions A given kernel may either read from or write to a pipe but not both Pipes are only coherent at the standard synchronization points the result of 6 6 Pipes 6 21 Copyright 2013 Advanced Micro Devices Inc All rights reserved 6
77. EXT KHR intptr t glCtx 0 status clGetGLContextInfoKHR CpsGL CL CURRENT DEVICE FOR GL CONTEXT KHR sizeof cl device id amp interopDevice NULL Create OpenCL context from device s id context clCreateContext cpsGL 1 amp interopDevice 0 0 amp status Using X Window System Use XOpenDi splay to open a connection to the server that controls a display Use g1XChooseFBConfig to get a list of GLX frame buffer configurations that match the specified attributes Use glXChooseVisual to get a visual that matches specified attributes Appendix E OpenCL OpenGL Interoperability Copyright 2013 Advanced Micro Devices Inc All rights reserved 10 AMD ACCELERATED PARALLEL PROCESSING Use XCreateColormap to create a color map of the specified visual type for the screen on which the specified window resides and returns the colormap ID associated with it Note that the specified window is only used to determine the screen Use XCreateWindow to create an unmapped sub window for a specified parent window returns the window ID of the created window and causes the X server to generate a CreateNotify event The created window is placed on top in the stacking order with respect to siblings Use XMapWindow to map the window and all of its sub windows that have had map requests Mapping a window that has an unmapped ancestor does not display the window but marks it as eligible
78. Example Programs Copyright 2013 Advanced Micro Devices queue reduce 1 NULL num groups NUL o5 amp ev Inc All rights reserved NULL amp src buf amp dst buf NULL amp dog buf amp num src items amp dev amp src buf amp dst buf 1 25 AMD ACCELERATED PARALLEL PROCESSING printf B W 2f GB sec float num src items sizeof cl uint NLOOPS t GetElapsedTime 1e9 7 Look at the results via synchronous buffer map dst ptr cl uint clEnqueueMapBuffer queue dst buf CL TRUE CL MAP READ 0 num groups sizeof cl uint O NULL NULL NULL dbg ptr cl uint clEnqueueMapBuffer queue dbg buf CL TRUE CL MAP READ 0 global work size sizeof cl uint O NULL NULL NULL 8 Print some debug info printf d groups 5d threads count 3d stride d n dbg ptr 0 dog ptr 1 dbg ptr 2 dbg ptr 3 if dst ptr 0 min printf result correct n else printf result INcorrect An printf Nn return 0 1 26 Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Chapter 2 AMD Implementation 2 41 The AMD Accelerated Parallel Processing Implementation of OpenCL AMD Accelerated Parallel Processing hamesses the tremendous process
79. F binary 1 storing OpenCL and driver versions that cre ated the binary 1 lvmir generating 3 storing OpenCL immediate representation LEVM IR es Ie 1 rodata storing OpenCL runtime control data 1 shstrtab forming an ELF 1 source storing OpenCL source program 1 strtab forming an ELF 1 symtab forming an ELF 1 text generating 3 storing the executable 1 Numerics 1D address 10 2D address 10 2D addresses reading and writing 10 A access MEMOTY ee 5 9 accumulation operations AMD Accelerated Parallel Processing OpenCL Programming Guide NDRange 1 address 1D meus ehe kre ete KLANG 10 2D eh a p ew BANANA BA ghee 10 normalized 10 un normalized 10 allocating images OpenGL seed reet PAG ha 4 memory selecting a device 4 memory buffer OpenCL program model 4 ALUs arrangement of 8 AMD Accelerated Parallel Processing implementation of OpenCL 1 open platform strategy 1 programming model 2 relationship of components 1 software stack
80. Image Format BIF v2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Appendix D Hardware overview of pre GCN devices This chapter provides a hardware overview of pre GCN devices Pre GCN devices include the Evergreen and Northern Islands families that are based on VLIW A general OpenCL device comprises compute units each of which can have multiple processing elements A work item or SPMD kernel instance executes on a single processing element The processing elements within a compute unit can execute in lock step using SIMD execution Compute units however execute independently see Figure D 1 AMD GPUs consist of multiple compute units The number of them and the way they are structured varies with the device family as well as device designations within a family Each of these processing elements possesses ALUs For devices in the Northern Islands and Southern Islands families these ALUs are arranged in four in the Evergreen family there are five processing elements with arrays of 16 ALUs Each of these arrays executes a single instruction across each lane for each of a block of 16 work items That instruction is repeated over four cycles to make the 64 element vector called a wavefront On Northern Islands and Evergreen family devices the PE arrays execute instructions from one wavefront so that each work item issues four for Northern Islands or five for Evergreen
81. MD ACCELERATED PARALLEL PROCESSING 5 2 2 Passing Classes between Host and Device This extension allows a developer to pass classes between the host and the device The mechanism used to pass the class to the device and back are the existing buffer object APIs The class that is passed maintains its state public and private members and the compiler implicitly changes the class to use either the host side or device side methods On the host side the application creates the class and an equivalent memory object with the same size using the sizeof function It then can use the class methods to set or change values of the class members When the class is ready the application uses a standard buffer API to move the class to the device either Unmap or Write then sets the buffer object as the appropriate kernel argument and enqueues the kernel for execution When the kernel finishes the execution the application can map back or read the buffer object into the class and continue working on it 5 3 Additions and Changes to Section 6 The OpenCL C Programming Language 5 3 1 Building C Kernels To compile a program that contains C kernels and functions the application must add the following compile option to clBuildProgramWithSource x language where language is defined as one of the following e clc the source language is considered to be OpenCL C as defined in the The OpenCL Programming Language version 1 21 e clct the
82. Machine LLVM See LLVM M macros GPU specific 1 predefined CPUS a a tot 13 GPU kaaga iaia ei aia a aina Da aa 13 OpenCL C kernels 13 mapping executions onto compute units 2 OpenCL mah ora anata char ad gawd 3 work items onto n dimensional grid ND Range aaah aa LEG be noe Nama Bes 3 work items to stream cores 2 masking GPUs 10 mem fence operation 4 memories interrelationship of 5 memory ACCESS sue ass EE ds ne dye elon 5 9 allocation select a device 4 architecture 5 bandwidth 7 double copying 7 commands 6 domains interrelationship 6 Index 9 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING fence barriers 3 operations 3 global VRAM 9 hiding latency 10 11 loads ose x awed ERU RERO 9 object allocation OpenCL context 3 OpenCL domains 5 read operations 10 request ses es autos Juana hia e 9 SONGS csi au RERO RAPERE 9 streaming 9 system pinned 7 transfer management FA write operation
83. NULL amp numPlatforms if sampleCommon gt checkVal status CL SUCCESS clGetPlatformIDs failed return SDK FAILURE nG AMD Accelerated Parallel Processing OpenCL Programming Guide B 1 Copyright 2013 Advanced Micro Devices Inc All rights reserved B 2 AMD ACCELERATED PARALLEL PROCESSING if O x numPlatforms cl platform id platforms new cl platform id numPlatforms status clGetPlatformIDs numPlatforms platforms NULL if sampleCommon checkVal status CL SUCCESS clGetPlatformIDs failed return SDK FAILURE nG for unsigned i 0 i lt numPlatforms i char pbuf 100 status clGetPlatformInfo platforms i buf if sampleCommon checkVal status CL SUCCESS p CL PLATFORM VENDOR S P izeof pbuf cIGetPlatformInfo failed return SDK FAILURE m platform platforms i if stremp pbuf Advanced Micro Devices Inc break delete platforms If we could find our platform use it Otherwise pass a NULL and get whatever the implementation thinks we should be using AY cl context properties cps 3 CL_CONTEXT PLATFORM cl context properties platform 0 Use NULL for backward compatibility cl context properties cprops NULL platform NULL cps context clCreateContextFromType cprops dType NULL NULL amp status
84. Thread Ox7ffff7e6b700 LWP 1894 New Thread Ox7ffff2fcc700 LWP 1895 Executing kernel for 1 iterations Breakpoint 1 Ox00007ffff77b9b20 in clEnqueueNDRangeKernel from home himanshu Desktop ati stream sdk v2 3 1nx64 lib x86 64 1ibOpenCL so gdb info functions OpenCL All functions matching regular expression OpenCL File OCLm20VFr cl void _ OpenCL bitonicSort kernel uint const uint const uint const uint const uint Non debugging symbols Ox00007ffff23c2dc0 OpenCL bitonicSort kernel plt 0x00007ffff23c2f40 OpenCL bitonicSort stub gdb b OpenCL bitonicSort kernel Breakpoint 2 at Ox7ffff23c2de9 file OCLm2oVFr cl line 32 gdb c Continuing Switching to Thread Ox7ffff2fcc700 LWP 1895 Breakpoint 2 OpenCL bitonicSort kernel theArray 0x615ba0 stage 0 passOfStage 0 width 1024 direction 0 at OCLm20VFr c1 32 32 uint sortIncreasing direction gdb p get global id 0 1 0 gdb c 4 2 Debugging CPU Kernels with GDB 4 3 Copyright 2013 Advanced Micro Devices Inc All rights reserved 4 2 4 4 4 Notes AMD ACCELERATED PARALLEL PROCESSING Continuing Breakpoint 2 _ OpenCL bitonicSort kernel theArray 0x615ba0 stage 0 passOfStage 0 width 1024 direction 0 at OCLm20VFr c1 32 32 uint sortIncreasing direction gdb p get global id 0 2 1 gdb 1 To make a breakpoint in a working thread with some particular ID in dimension N one techniqu
85. X cl Buffer context CL MEM READ ONLY CL MEM COPY HOST PTR sizeof cl float length pX bufY cl Buffer context CL MEM READ WRITE CL MEM COPY HOST PTR sizeof cl_ float length pY ARR I Mg a LP P BL Bg IMP lil Load CL file build CL program object create CL kernel object VMMIUMMMMlMIMMMIMMPWMPWMPMPWPBPPMPMPMMIIPMIMMPMMP I MP 1111111111 cl Program Sources sources 1 std make pair kernelStr c str kernelStr length program cl Program context sources program build devices kernel cl Kernel program saxpy VMMUIUMMMMMMEHMPWMPWPPMPMMMMPMPMPMPLM I EE EEE Set the arguments that will be used for kernel execution LOI DIT TI I IT TT TT TT TT TT LT TT ATTA kernel setArg 0 bufX kernel setArg 1 bufY kernel setArg 2 a VMIIMMMMMMIEIEIo P MHIIIPPMB IE EE gl gg Enqueue the kernel to the queue with appropriate global and local work sizes MIMMMMM MIEL TT TT TT TL I TT TT TT LT TT I TL TT TT LT I TT I TL TT TT TT TT queue enqueueNDRangeKernel kernel cl NDRange cl NDRange length cl NDRange 64 AR EE gl Ml Enqueue blocking call to read back buffer Y AR IEEE IE Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing Copyright 2013 Advanced Micro Devices Inc All rights reserved 1 6 3 AMD ACCELERATED PARALLEL PROCESSING queue enqueueR
86. a a ERES 1 for a specific device 2 for available platforms 2 the platform API 1 the runtime OpenCL building 3 queue command 4 9 R R9 290X series devices 8 random memory location GPU storage of writes 9 random access functionality NDRange 1 read imaging 10 read only buffers CON SANS i a xen ru Ed dens 10 running code on Linux 7 on Windows 6 runtime change CL options 5 code parallel min function 19 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING interface OpenCL 2 OpenGL ies Dana LABAN pee ax 1 changing options 6 system functions 5 S sample code relationship between b ffer s o ana coed Renee sun 4 command queue s 4 context s 4 4 4 device S AA ana adidas 4 kernel s 44444444 uen aus 4 relationship between context s 4 save temps compiler option 4 SAXPY function code sample 16 SC cache 7 scalar instructions 7 scalar unit data cache SC cache sesso erred aa 7 scalra instruction
87. a nananana natns rnt inta F 1 F 1 4 Address space quailfier functions F 2 ag APA AE F 2 F 1 6 Image Read and Write Functions eene nentes F 2 F17 Workgroup TUhClione ssussunsnnci tent and int du t ote dn d teRERE M ROREM CURE RAM ERR REB F 2 F 1 8 FIDE THOUGH Sete sitit ba meno SARE REP iti ail vut F 3 F 1 9 dicii me F 3 AG UE F 4 COMER 90 a a F 4 ES Newruntime APIs in OpenCL 2 0 mamamana nent ent ings F 6 F 3 1 a E AT F 6 F 3 2 Now Mattos sisina a KIN D RH HR HER LR UL RE bui nean F 6 F 3 3 New API alls iced rie need unes F 8 GO OMEN C orig Mem F 8 Index Contents Xi Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Xii Contents Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Figures 11 OpenCL Programming Mol anan reta tenth baa NAAAGNAS 1 3 1 2 Interrelationship of Memory Domains for Southern Islands Devices 1 6 d Datalow bebes Host and GPU AGANE AR BRA AA 1 6 14 High Level Memory TENU aap KAGABI 1 7 13 Memory Hierarchy TNO ea aaa SA AA epu AGANG ep et o kia DA 1 8 2 1 AMD Accelerated Parallel Processing Software Ecosystem 1 1 aman nannnawaaananananwanananan 2 1 2 2 Simplified Mapping of OpenCL onto AMD Accelerated Parallel Processing 2 2 2 3 Work Item Grouping Into Work Groups
88. al Extensions Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING This means that the extensions must be explicitly enabled to be used in kernel programs Each extension that affects kernel code compilation must add a defined macro with the name of the extension This allows the kernel code to be compiled differently depending on whether the extension is supported and enabled or not For example for extension c1 khr fp64 there should be a define directive for macro cl khr fp64 so that the following code can be preprocessed ifdef cl khr fp64 some code else some code endif A 5 Getting Extension Function Pointers Use the following function to get an extension function pointer void clGetExtensionFunctionAddress const char FunctionName This returns the address of the extension function specified by the FunctionName string The returned value must be appropriately cast to a function pointer type specified in the extension spec and header file A return value of NULL means that the specified function does not exist in the CL implementation A non NULL return value does not guarantee that the extension function actually exists queries described in sec 2 or 3 must be done to ensure the extension is supported The clGetExtensionFunctionAddress function cannot be used to get core API function addresses A 6 List of Supported Extensions that are Khrono
89. alized Normalized coordinates are between 0 0 and 1 0 inclusive For the fetch units to handle 2D addresses and normalized coordinates pre allocated memory segments must be bound to the fetch unit so that the correct memory address can be computed For a single kernel invocation up to 128 images can be bound at once for reading and eight for writing The maximum number of addresses is 8192x8192 for Evergreen and Northern Islands based devices 16384x16384 for Sl based products Image reads are cached through the texture system corresponding to the L2 and L1 caches 1 6 Example Programs 1 6 1 1 10 The following subsections provide simple programming examples with explanatory comments First Example Simple Buffer Write This sample shows a minimalist OpenCL C program that sets a given buffer to some value It illustrates the basic programming steps with a minimum amount of code This sample contains no error checks and the code is not generalized Yet many simple test programs might look very similar The entire code for this sample is provided at the end of this section 1 The host program must select a platform which is an abstraction for a given OpenCL implementation Implementations by multiple vendors can coexist on a host and the sample uses the first one available 2 Adevice id for a GPU device is requested A CPU device could be requested by using CL DEVICE TYPE CPU instead The device can be a physical device suc
90. ample for beginner level OpenCL programmers using C bindings The sample implements the SAXPY function Y aX Y where X and Y are vectors and ais a scalar The full code is reproduced at the end of this section It uses C bindings for OpenCL These bindings are available in the CL cl hpp file in the SDK available for AMD Accelerated Parallel Processing they also are downloadable from the Khronos website http www khronos org registry cl The following steps guide you through this example 1 Enable error checking through the exception handling mechanism in the C bindings by using the following define define CL ENABLE EXCEPTIONS This removes the need to error check after each OpenCL call If there is an error the C bindings code throw an exception that is caught at the end of the try block where we can clean up the host memory allocations In this example the C objects representing OpenCL resources c1 Context cl CommandQueue etc are declared as automatic variables so they do not 1 14 Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING need to be released If an OpenCL call returns an error the error code is defined in the CL c1 h file 2 The kernel is very simple each work item i does the SAXPY calculation for its corresponding elements Y i aX i Y i Bo
91. and Wavefronts cccceseeeeeeeeceeceeeceeeeeeeeeeeteeeeees 2 3 2 4 Generalized AMD GPU Compute Device Structure for GCN Devices 2 6 2 5 AMD Radeon HD 79XX Device Partial Block Diagram e ccceesereceeeeeeeneeeeeeeeenenees 2 7 2 6 Simplified Execution Of Wavefront On A Single Compute Unit 2 11 2 7 Compute Unit Stall Due to Data Dependency 2 ccececcceceecec cee eeeeeeeeeeeeeeteeeeeeeeeeeeteneeees 2 12 3 1 OpenCL Compiler TOCAM Aiia haasi AKA KANA ANA AA 3 1 X Runime Frocessing PIE none 3 6 A 1 Peer to Peer Transfers Using the c1 amd bus addressable memory Extension A 14 D 1 Generalized AMD GPU Compute Device Structure D 2 D 2 Simplified Block Diagram of an Evergreen Family GPU ssssen D 3 Contents xiii Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING xiv Contents Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Tables AT Extension Support tor AMD GPU DIGEST 2a kasa anuman naa A 15 A 2 Extension Support for Older AMD GPUs and CPUS A 16 ag ERP leader Fite 4 m oS TOES mE C 2 EA AMO Supported GL FM i pop EI D ertt Ear dE Ebr PERPE daa ER deut E 14 Contents XV Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING xv
92. and using it The device invokes the adequate device specific methods and accesses the class members passed from the host OpenCL C kernels defined with kernel may not be applied to a class constructor destructor or method except in the case that the class method is defined static and thus does not require object construction to be invoked Namespaces Namespaces are support without change as per 1 Overloading As defined in of the C language specification when two or more different declarations are specified for a single name in the same scope that name is said to be overloaded By extension two declarations in the same scope that declare the same name but with different types are called overloaded declarations Only kernel and function declarations can be overloaded not object and type declarations As per of the C language specification there are a number of restrictions as to how functions can be overloaded these are defined formally in Section 13 of the C language specification Note that kernels and functions cannot be overloaded by return type Also the rules for well formed programs as defined by Section 13 of the C language specification are lifted to apply to both kernel and function declarations The overloading resolution is per Section 13 1 of the C language specification but extended to account for vector types The algorithm for best viable function Section 13 3 3 of the C language spe
93. aneously by the host and the OpenCL devices To access locations within a buffer or regions within an image the appropriate offsets must be passed to and from the OpenCL devices In OpenCL 2 0 the host and OpenCL devices may share the same virtual address space Buffers need not be copied over between devices When the host and the OpenCL devices share the address space communication between the host and the devices can occur via shared memory This simplifies programming in heterogeneous contexts Support for SVM does not imply or require that the host and the OpenCL devices in an OpenCL 2 0 compliant architecture share actual physical memory The OpenCL runtime manages the transfer of data between the host and the OpenCL devices the process is transparent to the programmer who sees a unified address space A caveat however concerns situations in which the host and the OpenCL devices access the same region of memory at the same time It would be highly inefficient for the host and the OpenCL devices to have a consistent view of the memory for each load store from any device host In general the memory model of the language or architecture implementation determines how or when a memory location written by one thread or agent is visible to another The memory model also determines to what extent the programmer can control the scope of such accesses OpenCL 2 0 adopts the memory model defined C 11 with some extensions The memory orders
94. anges to Section 5 7 1 Creating Kernel Objects 5 2 5 2 2 Passing Classes between Host and Device ss 5 3 Additions and Changes to Section 6 The OpenCL C Programming Language 5 3 5 3 1 Building C Komelon 5 3 5 3 2 Classes and Derived DIES anuman anna ma 5 3 5 3 3 INGEN SS PACS kA a ina ns a 5 4 BSA OW LC 5 4 TT ee Lu E 5 5 5 3 6 cod M 5 6 5 3 7 il 5 6 5 8 Dynamic Op rations maana maana cones cocus nied occu cin center 5 6 rp cfe 5 6 5 4 1 Passing a Class from the Host to the Device and Back 5 6 54 2 Kernel Overloadingiueee orien ctae ucc cria scc na cuc cuta ucc at 5 7 5 4 3 cu Br M 5 8 Contents Copyright 2013 Advanced Micro Devices Inc All rights reserved Chapter 6 AMD ACCELERATED PARALLEL PROCESSING OpenCL 2 0 61 La Pi AA 6 1 6 2 Shared Virtual Memory SVM siminmi 6 1 6 2 1 iil e n 6 1 Lp EET 6 3 Coarse grained MOMONY unes m sttekest acc tnt hera rst tuc L PRO LEERE IRE Re EXE RE cac 6 4 D3 Generi Address SFM PL E 6 5 6 3 1 nA A nine 6 5 C AA 6 6 STNG IG AMP 6 6 AMD APP SDK example sessions 6 6 EN n 6 7 6 4 1 iu
95. ata structures are especially useful in heterogenous programming scenarios A typical scenario is as follows 1 Host creates SVM buffer s with cl1SVMAlloc 2 Host maps the SVM buffer s with blocking clEnqueueSVMMap 3 Host fills updates the SVM buffer s with data structures including pointers 4 Host unmaps the SVM buffer s by using clEnqueueSVMUnmap 5 Host enqueues processing kernels passing SVM buffers to the kernels with calls to clSetKernelArgSVMPointer and or clSetKernelExecInfo 6 2 Shared Virtual Memory SVM 6 3 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING 6 The OpenCL 2 0 device processes the structures in SVM buffer s including following updating pointers 7 Repeat Step 2 as necessary Note that the map and unmap operations in Steps 2 and 4 may be eliminated if the SVM buffers are created by using the CL MEM SVM FINE GRAIN BUFFER flag which may not be supported on all devices 6 2 2 1 Coarse grained memory Some applications do not require fine grained atomics to ensure that the SVM is consistent across devices after each read write access After the initial map creation of the buffer the GPU or any other devices typically read from memory Even if the GPU or other devices write to memory they may not require a consistent view of the memory For example while searching in parallel on a binary search tree as in the Binary Sear
96. ations to use multiple GPUs to run the compute task In some cases the user might want to mask the visibility of the GPUs seen by the OpenCL application One example is to dedicate one GPU for regular graphics operations and the other three in a four GPU system for Compute To do that set the GPU DEVICE ORDINAL environment parameter which is a comma separated list variable e Under Windows set GPU DEVICE ORDINAL 1 2 3 e Under Linux export GPU DEVICE ORDINAL 1 2 3 Another example is a system with eight GPUs where two distinct OpenCL applications are running at the same time The administrator might want to set GPU DEVICE ORDINAL to 0 1 2 3 for the first application and 4 5 6 7 for the second application thus partitioning the available GPUs so that both applications can run at the same time 24 Wavefront Scheduling GPU compute devices are very efficient at parallelizing large numbers of work items in a manner transparent to the application Each GPU compute device uses the large number of wavefronts to hide memory access latencies by having the resource scheduler switch the active wavefront in a given compute unit whenever the current wavefront is waiting for a memory access to complete Hiding memory access latencies requires that each work item contain a large number of ALU operations per memory load store Figure 2 6 shows the timing of a simplified execution of wavefronts in a single compute unit At tim
97. ay ww kernel void global scan kernel global int out unsigned int stage unsigned int 1 unsigned int vlen unsigned int prev gr prev el unsigned int curr gr curr el int add elem int lid get local id 0 int grid get group id 0 int szgr get local size 0 array size to be processed vlen 1 lt lt stage find the element to be added 1 grid gt gt stage ka prev gr 1 vlen lt lt 1 vlen 1 prev el prev_gr szgr szgr if lid 0 add elem out prev el 1 work group barrier CLK GLOBAL M EM F ENC E CLK LOCAL ME 6 12 Chapter 6 OpenCL 2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING add elem work group broadcast add elem 0 find the array to which the element to be added curr gr prev gr 1 grid vlen curr el curr gr szgr lid out curr el add elem The first kernel fills the primes Boolean array as in the OpenCL 1 2 example The second kernel runs group_scan_kernel on this boolean prime array However the group_scan_kernel function works only at the workgroup level The next kernel global scan kernel runs in multiple Log n stages where n is the size of the input array At each stage it merges two consecutive work_groups for which the prefix_sums are ready for
98. be created without having to use waiting mutexes semaphores The following kernel simultaneously inserts the IDs of various work items into the list array by using atomic CAS operation The same loop also runs on the host and inserts the other half N work items In this way 2 N numbers are inserted into this list kernel void linkKernel global int list int head i i get global id 0 1 head list 0 if i get global size 0 do list i head while atomic compare exchange strong global atomic int amp list 0 amp head i memory order release memory order acquire memory scope system Note how there is no wait to enter the critical section but list 0 and head are updated atomically On the CPU too a similar loop runs Again note that the variables list and head must be in fine grain SVM buffers memory order release and memory scope system are used to ensure that the CPU gets the updates hence the name platform atomics This sample illustrates the use of the atomic fetch operation The fetch operation is an RMW Read Modify Write operation The following kernel computes the maximum of the N numbers in array A The result of the intermediate comparisons is computed and the result is placed in a Boolean array B After Chapter 6 OpenCL 2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved 6 6 Pipes 6 6 1 Overview AMD ACCELERATED PARALL
99. before the kernel executes the output would be 0 31 47 O 19 0 0 0 0 11 The output array can be written into by using atomics directly But doing so would essentially make the loop run sequentially Extracting the non zero numbers will require atomics or processing on the CPU through a sequential loop both of which may not be inefficient or scalable In OpenCL 2 0 the new workgroup built ins can efficiently compute prefix sums and perform reductions on various operations like additions maximum and minimum on a workgroup Some such built in are work group scan inclusive lt op gt work group scan exclusive op and work group reduce op where op can be one of add max or nin The reduce operation does the reduction in dimension by yielding the sum of all elements in the array as a 6 4 Device side enqueue 6 9 Copyright 2013 Advanced Micro Devices Inc All rights reserved 6 10 AMD ACCELERATED PARALLEL PROCESSING scalar sum while the prefix sum yields an array of the same size with each element containing the prefix sum for that index Workgroup broadcast built ins are also used The following code illustrates how the workgroup built ins in OpenCL 2 0 can be used to efficiently extract the primes from an array of positive integers int isPrime int number int i if number 0 return 0 for i 2 i i number i if number i 0 retu
100. ble option to the compiler through the options string to clBuildProgram For example using the C API err program build devices g O0 To avoid source changes set the environment variable as follows AMD OCL BUILD OPTIONS APPEND g 00 or AMD OCL BUILD OPTIONS g 00 Below is a sample debugging session of a program with a simple hello world kernel The following GDB session shows how to debug this kernel Ensure that the program is configured to be executed on the CPU It is important to set CPU MAX COMPUTE UNITS 1 This ensures that the program is executed deterministically Setting the Breakpoint in an OpenCL Kernel To set a breakpoint use b N function kernel name where Nis the line number in the source code function is the function name and kernel name is constructed as follows if the name of the kernel is bitonicSort the kernel name is 8 OpenCL bitonicSort kernel Note that if no breakpoint is set the program does not stop until execution is complete Also note that OpenCL kernel symbols are not visible in the debugger until the kernel is loaded A simple way to check for known OpenCL symbols is to set a Chapter 4 Debugging OpenCL Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING breakpoint in the host code at clEnqueueNDRangeKernel and to use the GDB info functions __ OpenCL command as shown in the example below
101. ch tree sample presented later in this section coarse grain buffers are usually sufficient In general coarse grain buffers provide faster access compared to fine grain buffers as the memory is not required to be consistent across devices for i 0 i lt keys per wi itt key search keys init id i tmp node root while 1 if tmp node tmp node value key break tmp node key lt tmp node gt value tmp node gt left tmp node right found_nodes init id i tmp_node In the above example the binary search tree root is created using coarse grain SVM on the host as int flags CL_MEM READ ONLY if data node cl1SVMAlloc context flags num nodes sizeof node 0 NULL printf Error allocating memory for nodes n exit 1 6 4 Chapter 6 OpenCL 2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING The data is the tree created by the host as a coarse grain buffer and is passed to the kernel as an input pointer 23 46 3 22 1 92 449 58 8 50 5 86 11 24 87 13 95 4 15 259 12 34 30 10 180 73 51 58 N A 25 381 77 129 58 N A The above table shows the performance of the 2 0 implementation over the 1 2 implementation As SVM was absent in OpenCL 1 2 the tree needed to be sent to the GPU memory In addition as pointers were not pointing to the same tree a
102. cification is extended for vector types by inducing a partial ordering as a function of the partial ordering of its elements Following the existing rules for vector types in the OpenCL 1 2 specification explicit conversion between vectors is not allowed This reduces the number of Chapter 5 OpenCL Static C Programming Language Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING possible overloaded functions with respect to vectors but this is not expected to be a particular burden to developers because explicit conversion can always be applied at the point of function evocation For overloaded kernels the following syntax is used as part of the kernel name foo type type where type type must be either an OpenCL scalar or vector type or can be a user defined type that is allocated in the same source file as the kernel foo To allow overloaded kernels use the following syntax attribute mangled name myMangledName The kernel mangled name is Used as a parameter to pass to the clCreateKernel API This mechanism is needed to allow overloaded kernels without changing the existing OpenCL kernel creation API 5 3 5 Templates OpenCL C provides unrestricted support for C templates as defined in Section 14 of the C language specification The arguments to templates are extended to allow for all OpenCL base types including vectors and pointers q
103. ction over all work NOMS ac sorte RAD haha hab NIA 2 forced ordering of events 5 format string 12 conversion guidelines 12 resolving compile time 12 front end ensuring the kernels meet OpenCL specifica TOM Luna certe seatanmenstedesenes 2 performs semantic checks 2 syntactic checks 2 standard C 2 translating 2 front end supports additional data types float8 2 lC 2 additional keywords global ararua pen Dena de ws 2 kernel 2 built in functions barrier 4 4424 4444444 2 get_global_id 2 function call quetylng ire a iaa ga ee beats 2 function names undecorated in Windows 7 FunctionName string address of extension 3 G g compiler option 4 experimental feature 6 g option passing to the compiler 2 gather scatter model Local Data Store LDS 2 gcc not supported 4 GCC option syntax 3 GDB GNU project debugger 4 1 sample session 2 GDB GNU project debugger 6 COMMANA 4 4 44 ace eR ER 3 documentation 4 running cygwin 4 running MINGW
104. d devices e Create kernels within those programs e Allocate memory buffers or images either on the host or on the device s Memory can be copied between the host and device e Write data to the device e Submit the kernel with appropriate arguments to the command queue for execution e Read data back to the host from the device The relationship between context s device s buffer s program s kernel s and command queue s is best seen by looking at sample code 1 4 Synchronization 1 4 The two domains of synchronization in OpenCL are work items in a single work group and command queue s in a single context Work group barriers enable synchronization of work items in a work group Each work item in a work group must first execute the barrier before executing any instruction beyond this barrier Either all of or none of the work items in a work group must encounter the barrier A barrier or mem fence operation does not have global scope but is relevant only to the local workgroup on which they operate There are two types of synchronization between commands in a command queue e command queue barrier enforces ordering within a single queue Any resulting changes to memory are available to the following commands in the queue Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING e ev
105. d pipe is successful and a negative value otherwise int write pipe pipe gentype p reserve id t reserve id uint index const gentype ptr rite packet specified by ptr to the reserved area of the pipe referred to by reserve id and index he reserved pipe entries are referred to by indices that go from O num packets 1 Returns 0 if write pipe is successful and a negative value otherwise Commit flush functions void commit read pipe pipe gentype p reserve id t reserve id void commit write pipe pipe gentype p reserve id t reserve id void work group commit read pipe pipe gentype p reserve id t reserve id void work group commit write pipe pipe gentype p reserve id t reserve id void sub group commit read pipe pipe gentype p reserve id t reserve id void sub group commit write pipe pipe gentype p reserve id t reserve id Indicates that all reads and writes to num packets associated with reservation reserve id are completed Pipe Query Functions luint get pipe num packets pipe gentype p Returns the number of available entries in the pipe The number of available entries in a pipe is a dynamic value The Value returned should be considered immediately stale luint get pipe max packets pipe gentype p Returns the maximum number of packets specified when pipe was created The corresponding sample in the AMD APP SDK contains two kernels a producer kernel
106. defined Redwood return Redwood lif defined Cedar return Cedar elif defined ATI RV770 return RV770 elif defined ATI RV730 return RV730 elif defined ATI RV710 return RV710 elif defined Loveland return Loveland elif defined GPU return GenericGPU elif defined X86 return X86CPU elif defined X86 64 A 8 AMD Vendor Specific Extensions Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING return X86 64CPU elif defined CPU return GenericCPU else return UnknownDevice endif kernel void test_pf global int a printf Device Name s n getDeviceName A 8 13 cl amd bus addressable memory This extension defines an API for peer to peer transfers between AMD GPUs and other PCle device such as third party SDI I O devices Peer to peer transfers have extremely low latencies by not having to use the host s main memory or the CPU see Figure A 1 This extension allows sharing a memory allocated by the graphics driver to be used by other devices on the PCle bus peer to peer transfers by exposing a write only bus address It also allows memory allocated on other PCle devices non AMD GPU to be directly accessed by AMD GPUs One possible use of this is for a video capture device to directly write into the GPU memory using its DMA This extension is supported only on AMD Fir
107. e AMD ACCELERATED PARALLEL PROCESSING However such pointers must be associated with a named address space before they can be used Functions may be written with arguments and return values that point to the generic address space improving readability and programmability 6 3 2 1 Generic example In OpenCL 1 2 the developer needed to write three functions for a pointer p that can reference the local private or global address space void fool local int p void fooP private int p void fooG global int p In OpenCL 2 0 the developer needs to write only one function void foo int p As foo is a generic function the compiler will accept calls to it with pointers to any address space except the constant address space Note The OpenCL 2 0 spec itself shows most built in functions that accept pointer arguments as accepting generic pointer arguments 6 3 2 2 AMD APP SDK example 6 6 In the xxx APP SDK sample addMu12d is a generic function that uses generic address spaces for its operands The function computes the convolution sum of two vectors Two kernels compute the convolution one uses data in the global address space convolution2DUsingGlobal the other uses the local address space sepiaToning2DUsingLocal The use of a single function improves the readability of the source float4 addMul2D uchar4 src float filter int2 filterDim int width inti 3 float4 sum
108. e 0 the wavefronts are queued and waiting for execution In this example only four wavefronts TO T3 are scheduled for the compute unit The hardware limit for the number of active wavefront is dependent on the resource usage such as the number of active registers used of the program being executed An optimally programmed GPU compute device typically has many of active wavefronts 2 10 Chapter 2 AMD Implementation Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING OOOO STALL READY B W1 c O READY STALL fea Gen e W2 READY STALL C MEER READY STALL 0 20 40 60 80 om executing _ ready not executing XXX stalled Figure 2 6 Simplified Execution Of Wavefront On A Single Compute Unit At runtime wavefront TO executes until cycle 20 at this time a stall occurs due to a memory fetch request The scheduler then begins execution of the next wavefront T1 Wavefront T1 executes until it stalls or completes New wavefronts execute and the process continues until the available number of active wavefronts is reached The scheduler then returns to the first wavefront TO If the data wavefront TO is waiting for has returned from memory TO continues execution In the example in Figure 2 6 the data is ready so TO continues Since there were enough wavefronts and processing element operations to cover the long memory latencies the compute unit does not idle
109. e host uses the C 11 compiler and the same memory model kernel void ldstore volatile global int buffer global int atomicBuffer f ATE while atomic load explicit global atomic int amp atomicBuffer 0 memory order acquire 99 i get global id 0 buffer i i atomic store explicit global atomic int amp atomicBuffer i 10041 memory order release The kernel next stores 10041 where i is the ID of the work item into atomicBuffer i The order used is memory order release which ensures that the updated copy reaches the CPU which is waiting for it to report PASS for the test After the atomic operation the updates on fine grain variables such as buffer will also be available at the host The CPU checks for the following to ensure that the results are OK for i 0 i lt N i 6 5 Atomics and synchronization 6 19 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING while std atomic load explicit std atomic lt int gt amp atomicBuffer i std memory order acquire 10041 check the results now for i 0 i lt N i if buffer i 64 i printf Test Failed n printf Test Passed n 6 5 2 2 Atomic Compare and Exchange CAS 6 5 2 3 Atomic Fetch 6 20 This sample illustrates the use of the atomic CAS operation typically used for lock free programming in which a critical section can
110. e is to set a conditional breakpoint when the get global id N ID To do this use b N function kernel name if get global id N ID where N can be 0 1 or 2 2 For complete GDB documentation see http www gnu org software gdb documentation 3 For debugging OpenCL kernels in Windows a developer can use GDB running in cygwin or minGW It is done in the same way as described in sections 3 1 and 3 2 Notes Only OpenCL kernels are visible to GDB when running cygwin or minGW GDB under cygwin minGW currently does not support host code debugging Itis not possible to use two debuggers attached to the same process Do not try to attach Visual Studio to a process and concurrently GDB to the kernels of that process Continue to develop the application code using Visual Studio Currently gcc running in cygwin or minGW is not supported Chapter 4 Debugging OpenCL Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Chapter 5 OpenCL Static C Programming Language 5 1 Overview This extension defines the OpenCL Static C kernel language which is a form of the ISO IEC Programming languages C specification This language supports overloading and templates that can be resolved at compile time hence static while restricting the use of language features that require dynamic runtime resolving The language also is extended to support most of the
111. ePro professional graphics cards Graphics card SDI Input Output card PCle bus Peer to peer transfers CPU Memory gt CPU Figure A 1 Peer to Peer Transfers Using the cl amd bus addressable memory Extension A 14 Appendix A OpenCL Optional Extensions Copyright 2013 Advanced Micro Devices Inc All rights reserved SDI out SDI in AMD ACCELERATED PARALLEL PROCESSING A 9 Supported Functions for c1 amd fp64 cl khr fp64 AMD OpenCL is now c1 khr fp64 compliant on devices compliant with OpenCL 1 1 and greater Thus c1 amd fp64 is now a synonym for c1 khr fp64 on all supported devices A 10 Extension Support by Device Table A 1 and Table A 2 list the extension support for selected devices Table A 1 Extension Support for AMD GPU Devices 1 A M D APUs A M D Radeon H D Tahiti Pitcairn Extension Brazos Llano Trinity Cape Verde Turks Cayman Barts Cypress Cl khr atomics 32 bit Yes Yes Yes Yes Yes Yes Yes Yes ci ext atomic counters 32 Yes Yes Yes Yes Yes Yes Yes Yes cl khr gl sharing Yes Yes Yes Yes Yes Yes Yes Yes Ci khr byte addressable store Yes Yes Yes Yes Yes Yes Yes Yes cl ext device fission CPU CPU CPU No No No No No only only only cl amd device attribute query Yes Yes Yes Yes Yes Yes Yes Yes cl khr fp64 CPU CPU CPU Yes Yes Yes No Yes only only only cl amd fp64 CPU CPU CPU Yes Yes Yes No
112. eadBuffer bufY CL TRUE 0 length sizeof cl float pY printVector Y pY length LOTO T IT TI TI TL I TT TT I TT TT LL TT TT IT TT I TT I TL TT TT TT PB TT IT I TT TT I TITI TTT Release host resources ARR I Mg Mg EE cleanupHost catch cl Error err AA AA Catch OpenCL errors and print log if it is a build error NOOD Am cerr lt lt ERROR lt lt err what lt lt lt lt err err lt lt lt lt endl if err err CL BUILD PROGRAM FAILURE string str program getBuildInfo CL PROGRAM BUILD LOG gt devices 0 cout lt lt Program Info lt lt str lt lt endl cleanupHost catch string msg cerr lt lt Exception caught in main lt lt msg lt lt endl cleanupHost Example Parallel Min Function This medium complexity sample shows how to implement an efficient parallel min function The code is written so that it performs very well on either CPU or GPU The number of threads launched depends on how many hardware processors are available Each thread walks the source buffer using a device optimal access pattern selected at runtime A multi stage reduction using local and global atomics produces the single result value The sample includes a number of programming techniques useful for simple tests Only minimal error checking and resource tear down is used Runtime Code 1 The source memory buffer is allocated
113. eference Guide for more specifics about optimization techniques 1 5 1 Data Share Operations Local data share LDS is a very low latency RAM scratchpad for temporary data with at least one order of magnitude higher effective bandwidth than direct uncached global memory It permits sharing of data between work items in a work group as well as holding parameters for pixel shader parameter interpolation Unlike read only caches the LDS permits high speed write to read re use of the memory space full gather read load and scatter write store operations Figure 1 4 shows the conceptual framework of the LDS is integration into the memory of AMD GPUs using OpenCL Compute Device Work Group Work Group Private Private Memory Memory Global Constant Memory Host Memory Figure 1 4 High Level Memory Configuration 1 5 Memory Architecture and Access 1 7 Copyright 2013 Advanced Micro Devices Inc All rights reserved 1 5 2 1 8 AMD ACCELERATED PARALLEL PROCESSING Physically located on chip directly next to the ALUs the LDS is approximately one order of magnitude faster than global memory assuming no bank conflicts There are 32 kB memory per compute unit segmented into 32 or 16 banks depending on the GPU type of 1 k dwords for 32 banks or 2 k dwords for 16 banks Each bank is a 256x32 two port RAM 1R 1W per clock cycle Dwords are placed in the banks serially but all banks can execute a store or load simulta
114. el objects if there are more than one cl Program Sources sources 1 std make pair kernelStr c str kernelStr length program cl Program context sources program build devices kernel cl Kernel program saxpy 9 Enqueue the kernel for execution on the device GPU in our example Set each argument individually in separate kernel setArg calls The arguments do not need to be set again for subsequent kernel enqueue calls Reset only those arguments that are to pass a new value to the kernel Then enqueue the kernel to the command queue with the appropriate global and local work sizes 1 6 Example Programs 1 15 Copyright 2013 Advanced Micro Devices Inc All rights reserved 1 16 10 pY 11 The bindings code If there is an OpenCL call error it prints out the name of the call the error code codes are defined in CL c1 h If there is a kernel compilation and erro AMD ACCELERATED PARALLEL PROCESSING kernel setArg 0 bufX kernel setArg 1 bufY kernel setArg 2 a queue enqueueNDRangeKernel kernel cl NDRange cl NDRange length cl NDRange 64 Read back the results from bufY to the host pointer pY We will make this a blocking call using the CL TRUE argument since we do not want to proceed before the kernel has finished execution and we have our results back queue enqueueReadBuffer bufY CL TRUE 0 length sizeof cl float Clean up the host
115. elated additions provide effective ways of expressing heterogeneous programming constructs The following sections highlight the salient features of OpenCL 2 0 and provide usage guidelines e Shared Virtual Memory SVM e Generic Address Space e Device side enqueue e Atomics and synchronization e Pipes e Sub groups e Program scope global Variables e Image Enhancements e Non uniform work group size Sample code is included wherever appropriate complete samples illustrating the OpenCL 2 0 features are provided with the AMD APP SDK For guidelines on how to migrate from OpenCL 1 2 to OpenCL 2 0 and for information about querying for image and device specific extensions see Portability considerations For a list of the new and deprecated functions see Appendix F New and deprecated functions in OpenCL 2 0 6 2 Shared Virtual Memory SVM 6 2 1 Overview In OpenCL 1 2 the host and OpenCL devices do not share the same virtual address space Consequently the host memory the device memory and communication between the host and the OpenCL devices need to be explicitly AMD Accelerated Parallel Processing OpenCL Programming Guide 6 1 Copyright 2013 Advanced Micro Devices Inc All rights reserved 6 2 AMD ACCELERATED PARALLEL PROCESSING specified and managed Buffers may need to be copied over to the OpenCL device memory for processing and copied back after processing the same memory cannot be accessed simult
116. ementations and between different versions of OpenCL 6 11 Portability considerations 6 29 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING 6 30 Chapter 6 OpenCL 2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Appendix A OpenCL Optional Extensions The OpenCL extensions are associated with the devices and can be queried for a specific device Extensions can be queried for platforms also but that means that all devices in the platform support those extensions Table A 1 on page A 15 lists the supported extensions A 1 Extension Name Convention The name of extension is standardized and must contain the following elements without spaces in the name in lower case cl khr extension name gt for extensions approved by Khronos Group For example cl khr fp64 e cl ext extension name gt for extensions provided collectively by multiple vendors For example cl ext device fission e cl vendor name extension name for extension provided by a specific vendor For example cl amd media ops The OpenCL Specification states that all API functions of the extension must have names in the form of cl lt FunctionName gt KHR cl lt FunctionName gt EXT or cl FunctionName VendorName All enumerated values must be in the form of CL enum name KHR CL enum name EXT Or
117. enCL 1 2 permits the declaration of only constant address space variables at program scope OpenCL 2 0 permits the declaration of variables in the global address space at program i e outside function scope These variables have the lifetime of the program in which they appear and may be initialized The host cannot directly access program scope variables a kernel must be used to read write their contents from to a buffer created on the host Program scope global variables can save data across kernel executions Using program scope variables can potentially eliminate the need to create buffers on the host and pass them into each kernel for processing However there is a limit to the size of such variables The developer must ensure that the total size does not exceed the value returned by the device info query CL DEVICE MAX GLOBAL VARIABLE SIZE 6 9 Image Enhancements 6 9 1 6 24 Overview OpenCL 2 0 introduces significant enhancements for processing images Chapter 6 OpenCL 2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved 6 9 2 sRGB AMD ACCELERATED PARALLEL PROCESSING A read write access qualifier for images has been added The qualifier allows reading from and writing to certain types of images verified against clGetSupportedImageFormats by using the CL MEM KERNEL READ AND WRITE flag in the same kernel but reads must be sampler less An atomic work item fence with
118. ents enforces ordering between or within queues Enqueued commands in OpenCL return an event identifying the command as well as the memory object updated by it This ensures that following commands waiting on that event see the updated memory objects before they execute 1 5 Memory Architecture and Access OpenCL has four memory domains private local global and constant the AMD Accelerated Parallel Processing system also recognizes host CPU and PCI Express PCle memory Memory Type Description private local global constant host CPU PCle Specific to a work item it is not visible to other work items Specific to a work group accessible only by work items belonging to that work group Accessible to all work items executing in a context as well as to the host read write and map commands Read only region for host allocated and initialized objects that are not changed during kernel execution Host accessible region for an application s data structures and program data Part of host CPU memory accessible from and modifiable by the host program and the GPU compute device Modifying this memory requires synchronization between the GPU compute device and the CPU Figure 1 2 illustrates the interrelationship of the memories 1 5 Memory Architecture and Access 1 5 Copyright 2013 Advanced Micro Devices Inc All rights reserved 1 6 AMD ACCELERATED PARALLEL PROCESSING Compute Device Co
119. eported and can be used to create OpenCL binary images cl amd event callback This extension provides the ability to register event callbacks for states other than cl complete The full set of event states are allowed c1 queued cl submitted and cl running This extension is enabled automatically and does not need to be explicitly enabled through pragma when using the SDK v2 of AMD Accelerated Parallel Processing Appendix A OpenCL Optional Extensions Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING A 8 8 cl amd popcnt This extension introduces a population count function called popcnt This extension was taken into core OpenCL 1 2 and the function was renamed popcount The core 1 2 popcount function documented in section 6 12 3 of the OpenCL Specification is identical to the AMD extension popcnt function A8 9 cl amd media ops This extension adds the following built in functions to the OpenCL language Note For OpenCL scalar types n 1 for vector types it is 2 4 8 or 16 Note in the following n denotes the size which can be 1 2 4 8 or 16 denotes the indexed element of a vector designated O to n 1 Built in function amd pack uint amd pack float4 src Return value uint src 0 amp OxFF lt lt 0 uint src 1 amp OxFF lt lt 8 uint src 2 amp OxFF lt lt 16 uint src 3 amp
120. er of work group Sum reduction across sub group Max reduction across sub group Min reduction across sub group Sum exclusive scan across sub group Max exclusive scan across sub group Min exclusive scan across sub group Sum inclusive scan across sub group Max inclusive scan across sub group Min inclusive scan across sub group sub group read reservation sub group write reservation sub group commit read reservation sub group commit write reservation number of sub groups in uniform part of NDRange maximum sub group size for a block Appendix F New and deprecated functions in OpenCL 2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING write mem fence atomic add atomic sub atomic xchg atomic inc atomic dec atomic cmpxchg atomic min atomic max atomic and atomic or atomic xor F2 Deprecated built ins Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING F3 New runtime APIs in OpenCL 2 0 F 3 1 New Types cl_device_svm_capabilities Returned by cIGetDevicelnfo CL DEVICE SVM CAP ABILITIES cl queue properties See clCreateCommandQueueWithProperties cl_svm_mem_flags See cISVMAlloc cl_pipe_properties See clCreatePipe cl_pipe_info See clGetPipelnfo cl_sampler_properties See clCreateSamplerWithProperties cl_kernel_exec_info See clSetKernelExecInfo cl image desc A field name changed from buffer to mem objec
121. eresting sub region the kernel can instead launch a new sub kernel to process each marked sub Chapter 6 OpenCL 2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING region This significantly simplifies the code and improves efficiency due to the elimination of the interactions with and dependence on the host 6 4 2 3 Extracting Primes from an array by using device side enqueue Given an array of positive integers the device side enqueue feature in OpenCL 2 0 can be used to efficiently extract the prime numbers present in that array and store them into another output array The following figure explains the input and output expected from this sample Input 12 31 47 64 19 27 49 81 99 11 Output 31 47 19 11 As can be seen in the Figure above the input is given as an array of positive numbers and the expected output is an array of the primes in the input array This is a classic example of data parallel processing in which the processing for each array element for checking whether the element is prime or not can be done in parallel with all others In OpenCL 1 2 this processing can be done as follows kernel extractPrimes global int in global int out int id get_global id 0 if isPrime in id out id in id Assuming that the output array is initialized with zeroes
122. ernel is eliminated device enqueue can improve the performance of applications Some platforms such as AMD s provide a standard way of enqueuing work to the hardware which can further improve the performance Device side enqueue has been observed to reduce by the overhead of enqueuing by more than 3x in some cases Device side enqueue can especially benefit applications that are inherently recursive or comprise work to be performed dynamically 2 In OpenCL 1 2 knowing when all the workgroups of the running kernel have completed execution requires waiting on a completion event from that kernel This necessitates waiting on the host before another kernel can be launched if the host requires the result of a computation before it can proceed Device side enqueue allows the parent kernel to launch the child kernel and thus eliminates this wait The extracting primes example in the subsequent sections illustrate this benefit 6 4 Device side enqueue 6 7 Copyright 2013 Advanced Micro Devices Inc All rights reserved 6 4 2 AMD ACCELERATED PARALLEL PROCESSING 6 4 2 1 Iterate until convergence Suppose a complex process requires 4 kernels A B C and Check and that these kernels must be run in order repeatedly until the Check kernel produces a value indicating that the process has converged In OpenCL 1 2 the host side code to perform this might be structured as follows Enqueue kernel A Enqueue kernel B Enqueue kernel C Enqu
123. est performance is attained when the group size is an integer multiple of the wavefront size local data store The LDS is a high speed low latency memory private to each compute LDS unit It is a full gather scatter model a work group can write anywhere in its allocated space This model is unchanged for the AMD Radeon HD 7XXX series The constraints of the current LDS model are e The LDS size is allocated per work group Each work group specifies how much of the LDS it requires The hardware scheduler uses this information to determine which work groups can share a compute unit e Data can only be shared within work items in a work group e Memory accesses outside of the work group result in undefined behavior 1 2 OpenCL Overview 1 2 The OpenCL programming model consists of producing complicated task graphs from data parallel execution nodes In a given data parallel execution commonly known as a kernel launch a computation is defined in terms of a sequence of instructions that executes at each point in an N dimensional index space It is a common though by not required formulation of an algorithm that each computation index maps to an element in an input data set The OpenCL data parallel programming model is hierarchical The hierarchical subdivision can be specified in two ways e Explicitly the developer defines the total number of work items to execute in parallel as well as the division of work items into specif
124. etween device and host 2 3 2 DMA Transfers Certain memory transfer calls use the DMA engine To properly leverage the DMA engine make the associated OpenCL data transfer calls See the AMD OpenCL Optimization Reference Guide for more information Direct Memory Access DMA memory transfers can be executed separately from the command queue using the DMA engine on the GPU compute device DMA calls are executed immediately and the order of DMA calls and command queue flushes is guaranteed DMA transfers can occur asynchronously This means that a DMA transfer is executed concurrently with other system or GPU compute operations when there are no dependencies However data is not guaranteed to be ready until the DMA engine signals that the event or transfer is completed The application can use OpenCL to query the hardware for DMA event completion If used carefully DMA transfers are another source of parallelization Southern Island devices have two DMA engines that can perform bidirectional transfers over the PCle bus with multiple queues created in consecutive order since each DMA engine is assigned to an odd or an even queue correspondingly 2 3 Communication Between Host and the GPU Compute Device 2 9 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING 2 3 3 Masking Visible Devices By default OpenCL applications are exposed to all GPUs installed in the system this allows applic
125. etween the system and the GPU compute device occur on the PCle channel AMD Accelerated Parallel Processing graphics cards use PCle 2 0 x16 second generation 16 lanes Generation 1 x16 has a theoretical maximum throughput of 4 GBps in each direction Generation 2 x16 doubles the throughput to 8 GBps in each direction Southern Chapter 2 AMD Implementation Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Islands AMD GPUs support PCle 3 0 with a theoretical peak performance of 16 GBps Actual transfer performance is CPU and chipset dependent Transfers from the system to the GPU compute device are done either by the command processor or by the DMA engine The GPU compute device also can read and write system memory directly from the compute unit through kernel instructions over the PCle bus 2 3 1 Processing API Calls The Command Processor The host application does not interact with the GPU compute device directly A driver layer translates and issues commands to the hardware on behalf of the application Most commands to the GPU compute device are buffered in a command queue on the host side The queue of commands is sent to and processed by the GPU compute device There is no guarantee as to when commands from the command queue are executed only that they are executed in order Command queue elements include e Kernel execution calls e Kernels e Constants e Transfers b
126. eue kernel Check Enqueue blocking map of Check result e g with clEnqueueSVMMap DO me Go INO If Check result is not Converged then Enqueue unmap of Check result 7 Go to Step 1 However with device side enqueue in OpenCL 2 0 the Check kernel may be altered to enqueue blocks that carry out A B C and Check when it detects that convergence has not been reached This avoids a potentially costly interaction with the host on each iteration Also a slight modification of Check might allow the replacement of the entire loop above with a single host side enqueue of the Check kernel 6 4 2 2 Data dependent refinement 6 8 Consider a search or computational process that works from coarse levels to increasingly finer levels that operates something like this 1 Search Compute over current region 2 Loop over sub regions in current region 3 If a sub region is interesting 1 Refine the sub region 2 Apply a process to the refined sub region With OpenCL 1 2 this process would require a complex interaction between the host and the OpenCL device The device side kernel would need to somehow mark the sub regions requiring further work and the host side code would need to scan all of the sub regions looking for the marked ones and then enqueue a kernel for each marked sub region This is made more difficult by the lack of globally visible atomic operations in OpenCL 1 2 However with OpenCL 2 0 rather than just marking each int
127. exe Does not generate the executable ISA in text section For more information see Appendix C OpenCL Binary Image Format BIF v2 0 save temps lt prefix gt This option dumps intermediate temporary files such as IL and ISA code for each OpenCL kernel If lt prefix gt is not given temporary files are saved in the default temporary directory the current directory for Linux C Users lt user gt AppData Local for Windows If prefix is given those temporary files are saved with the given prefix f prefix is an absolute path prefix such as C your work dir mydumpprefix those temporaries are saved under C your work dir with mydumpprefix as prefix to all temporary names For example Chapter 3 Building and Running OpenCL Programs Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING save temps under the default directory temp nn xxx yyy il temp nn xxx yyy isa save temps aaa under the default directory aaa nn XXX yyy il aaa nn Xxx yyy isa save temps C you dir bbb under C you dir bbb nn xxx yyy il bbb nn xxx yyy isa where xxx and yyy are the device name and kernel name for this build respectively and nn is an internal number to identify a build to avoid overriding temporary files Note that this naming convention is subject to change To avoid source changes there are two environment variables that can be used to change CL o
128. features described in Section 6 of OpenCL spec new data types vectors images samples etc OpenCL Built in functions and more 5 1 1 Supported Features The following list contains the major C features supported by this extension e Kernel and function overloading e Inheritance Strict inheritance Friend classes Multiple inheritance e Templates Kernel templates Member templates Template default argument Limited class templates the virtual keyword is not exposed Partial template specialization e Namespaces e References e this operator Note that supporting templates and overloading highly improve the efficiency of writing code it allows developers to avoid replication of code when not necessary 1 Programming languages C International Standard ISO IEC 14881 1998 AMD Accelerated Parallel Processing OpenCL Programming Guide 5 1 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Using kernel template and kernel overloading requires support from the runtime API as well AMD provides a simple extension to clCreateKernel which enables the user to specify the desired kernel 5 1 2 Unsupported Features C features not supported by this extension are Virtual functions methods marked with the virtual keyword Abstract classes a class defined only of pure virtual functions Dynamic memory allocation non placement
129. format string is a character sequence that is null terminated and composed of zero and more directives ordinary characters i e not 96 which are copied directly to the output stream unchanged and conversion specifications each of which can result in fetching zero or more arguments converting them and then writing the final result to the output stream The format string must be resolvable at compile time thus it cannot be dynamically created by the executing program Note that the use of variadic arguments in the built in printf does not imply its use in other built ins more importantly it is not valid to use printf in user defined functions or kernels The OpenCL C printf closely matches the definition found as part of the C99 standard Note that conversions introduced in the format string with 9 o are supported with the following guidelines A 32 bit floating point argument is not converted to a 64 bit double unless the extension c1 khr fp64 is supported and enabled as defined in section 9 3 of the OpenCL Specification 1 1 This includes the double variants if c1 khr fp64 is supported and defined in the corresponding compilation unit 64 bit integer types can be printed using 31d 1x 1lu lld 11x llu are not supported and reserved for 128 bit integer types long long All OpenCL vector types section 6 1 2 of the OpenCL Specification 1 1 can be explicitly passed and printed usi
130. found std endl break status clGetGLContextInfoKHR cpsGL CL CURRENT DEVICE FOR GL CONTEXT KHR sizeof cl device id amp interopDeviceld NULL Create OpenCL context from device s id context clCreateContext cpsGL ar amp interopDeviceld 0 0 amp status E 3 Additional GL Formats Supported The following is a list of GL formats beyond the minimum set listed in The OpenCL Extension Specification v 1 2 that AMD supports E 3 Additional GL Formats Supported E 13 Copyright 2013 Advanced Micro Devices Inc All rights reserved E 14 AMD ACCELERATED PARALLEL PROCESSING Table E 1 GL ALPHA8 GL R8 GL R8UI GL R8l GL RG8 GL_RG8UI GL RG8I GL RGB8 GL RGB8UI GL RGBBI GL R16 GL R16UI GL R161 GL RG16 GL_RG16UI GL RG161 GL RGB16 GL RGB16UI GL RGBt l GL R321 GL R32UI GL R32F GL_RG321 GL RG32UI GL RG32F GL RGB32l GL RGB32UI GL RGB32F AMD Supported GL Formats GL internal format CL images format CL A CL UNORM8 CL R CL UNORM INT8 CL R CL UNSIGNED INT8 CL R CL SIGNED INT8 CL RG CL UNORM INT8 CL RG CL UNSIGNED INT8 CL RG CL SIGNED INT8 CL RGB CL UNORM INT8 CL RGB CL UNSIGNED INT8 CL RGB CL SIGNED INT8 CL R CL UNORM INT16 CL R CL UNSIGNED INT16 CL R CL SIGNED INT16 CL RG CL UNORM INT16 CL RG CL UNSIGNED INT16 CL RG CL SIGNED INT16 CL RGB CL UNORM INT16 CL RGB CL UNSIGNED INT16 CL RGB CL SIGNED
131. ger multiple of 64 work items is recommended Work Item Creation For each work group the GPU compute device spawns the required number of wavefronts on a single compute unit If there are non active work items within a wavefront the processing elemnts that would have been mapped to those work items are idle An example is a work group that is a non multiple of a wavefront size for example if the work group size is 32 the wavefront is half empty and unused Flow Control Flow control such as branching is achieved by combining all necessary paths as a wavefront If work items within a wavefront diverge all paths are executed serially For example if a work item contains a branch with two paths the wavefront first executes one path then the second path The total time to execute the branch is the sum of each path time An important point is that even if only one work item in a wavefront diverges the rest of the work items in the wavefront execute the branch The number of work items that must be executed during a branch is called the branch granularity On AMD hardware the branch granularity is the same as the number of work items in a wavefront Masking of wavefronts is effected by constructs such as Chapter 2 AMD Implementation Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING if x items within these braces A items within these braces B The wavefron
132. ghts reserved 2 1 1 2 1 2 2 1 3 2 4 AMD ACCELERATED PARALLEL PROCESSING Work Item Processing All processing elements within a vector unit execute the same instruction in each cycle A work item can issue one instruction per clock cycle The block of work items that are executed together is called a wavefront To hide latencies due to memory accesses and processing element operations up to four work items from the same wavefront are pipelined on the same vector unit The size of wavefronts can differ on different GPU compute devices For example some of the low end and older GPUs such as the AMD Radeon HD 54XX series graphics cards have a wavefront size of 32 work items Higher end and newer AMD GPUs have a wavefront size of 64 work items Compute units operate independently of each other so it is possible for different compute units to execute different tll Before discussing flow control it is necessary to clarify the relationship of a wavefront to a work group If a user defines a work group it consists of one or more wavefronts A wavefront is a hardware thread with its own program counter it is capable of following control flow independently of other wavefronts A wavefront consists of 64 or fewer work items The mapping is based on a linear work item order On a device with a wavefront size of 64 work items 0 63 map to wavefront 0 work items 64 127 map to wavefront 1 etc For optimum hardware usage an inte
133. h as a given GPU or an abstracted device such as the collection of all CPU cores on the host Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING 3 On the selected device an OpenCL context is created A context ties together a device memory buffers related to that device OpenCL programs and command queues Note that buffers related to a device can reside on either the host or the device Many OpenCL programs have only a single context program and command queue 4 Before an OpenCL kernel can be launched its program source is compiled and a handle to the kernel is created 5 A memory buffer is allocated in the context 6 The kernel is launched While it is necessary to specify the global work size OpenCL determines a good local work size for this device Since the kernel was launch asynchronously clFinish is used to wait for completion 7 The data is mapped to the host for examination Calling clEnqueueMapBuf fer ensures the visibility of the buffer on the host which in this case probably includes a physical transfer Alternatively we could use clEnqueueWriteBuffer which requires a pre allocated host side buffer 1 6 Example Programs 1 11 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Example Code 1
134. ho CENTRE ARR AA AA 2 10 Contents vii Copyright 2013 Advanced Micro Devices Inc All rights reserved Chapter 3 Chapter 4 Chapter 5 viii AMD ACCELERATED PARALLEL PROCESSING Building and Running OpenCL Programs 3 1 3 2 33 Compiling the Proga uacua urna secta caca checa cua s 3 2 3 1 1 SS genui rm ww AMA 3 2 a COMITE AA 3 3 3 4 3 Supported Standard OpenCL Compiler Options 3 4 3 1 4 AMD Developed Supplemental Compiler Options sess 3 4 Kuning the PINYA NGANGA 3 5 3 2 1 Running Code on WINGOWS Kan mamamana panaman pamamasa 3 6 32 2 Running Code on LINA naa naaa naaa inarsenaniatteneinataeneinaiaeneinarauceniaie 3 7 Calling Conventions aa mamamana a aaa a aaa aaa na a maam 3 7 Debugging OpenCL 4 1 4 2 AMD CodeXL GPU DebUGUBl s sucre ementi er edens iere edunt uideor dent iure or etiarn 4 1 Debugging CPU Kernels with GDB sens 4 2 4 2 1 Setting the ENVIFONMENT 4 2 4 2 2 Setting the Breakpoint in an OpenCL Kernel 4 2 423 Sample GDB SESSION BAGBAG KANA ni EE edd 4 3 v 4 4 OpenCL Static C Programming Language 5 1 5 2 5 3 5 4 OVETVIEW nan GANG BKA KAGABE RAK AA 5 1 5 1 1 oi a AA 5 1 5 1 2 Uns pported FANS sise urnes 5 2 5 13 Relations with ISQUEG CE nassnsusiossisneiman 5 2 Additions and Changes to Section 5 The OpenCL C Runtime sss 5 2 5 21 Additions and Ch
135. i Contents Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing This chapter provides a general software and hardware overview of the AMD Accelerated Parallel Processing implementation of the OpenCL standard It explains the memory structure and gives simple programming examples 1 1 Terminology Term compute kernel Description To define a compute kernel it is first necessary to define a kernel A kernel is a small unit of execution that performs a clearly defined function and that can be executed in parallel Such a kernel can be executed on each element of an input stream called an NDRange or simply at each point in an arbitrary index space A kernel is analogous and on some devices identical to what graphics programmers call a shader program This kernel is not to be confused with an OS kernel which controls hardware The most basic form of an NDRange is simply mapped over input data and produces one output item for each input tuple Subsequent extensions of the basic model provide random access functionality variable output counts and reduction accumulation operations Kernels are specified using the kernel keyword A compute kernel is a specific type of kernel that is not part of the traditional graphics pipeline The compute kernel type can be used for graphics but its strength lies in using it fo
136. ic work groups e Implicitly the developer specifies the total number of work items to execute in parallel and OpenCL manages the division into work groups OpenCL s API also supports the concept of a task dispatch This is equivalent to executing a kernel on a compute device with a work group and NDRange containing a single work item Parallelism is expressed using vector data types implemented by the device enqueuing multiple tasks and or enqueuing native kernels developed using a programming model orthogonal to OpenCL Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING 1 3 Programming Model The OpenCL programming model is based on the notion of a host device supported by an application API and a number of devices connected through a bus These are programmed using OpenCL C The host API is divided into platform and runtime layers OpenCL C is a C like language with extensions for parallel programming such as memory fence operations and barriers Figure 1 1 illustrates this model with queues of commands reading writing data and executing kernels for specific devices _kernel foo _kernel foo Wi Wi Wi Wi Wi Wi Wi Wi RADEON GRAPHICS barrier AMDA Context Queue Queue Figure 1 1 OpenCL Programming Model The devices are capable of running data and task para
137. imental features provided by AMD Accelerated Parallel Processing GNU project debugger GDB to debug kernels on x86 CPUs running Linux or cygwin minGW under Windows 4 1 AMD CodeXL GPU Debugger CodeXL 1 2 is available as an extension to Microsoft Visual Studio a stand alone version for Windows and a stand alone version for Linux It offers real time OpenCL kernel debugging and memory analysis on GPU devices and lets developers e access the kernel execution directly from the API call that issues it e debug inside the kernel and e view all variable values across the different work groups and work items It also provides OpenGL API level debugging and memory analysis For information on downloading and installing CodeXL see http developer amd com tools and sdks heterogeneous computing codexl After installing CodeXL launch Visual Studio then open the solution to be worked on In the Visual Studio menu bar note the new CodeXL menu which contains all the required controls Select a Visual C C project and set its debugging properties as normal To add a breakpoint either select New CodeXL Breakpoint from the CodeXL menu or navigate to a kernel file c1 used in the application and set a breakpoint on the appropriate source line Then select the Launch OpenCL OpenGL Debugging from the CodeXL menu to start debugging CodeXL currently supports only API level debugging and OpenCL kernel debugging Stepping through C C c
138. ing power of GPUs for high performance data parallel computing in a wide range of applications The AMD Accelerated Parallel Processing system includes a software stack AMD GPUs and AMD multicore CPUs Figure 2 1 illustrates the relationship of the AMD Accelerated Parallel Processing components Compute Applications Third Party Tools OpenCL Runtime Multicore AMD GPUs CI CPUs Figure 2 1 AMD Accelerated Parallel Processing Software Ecosystem The AMD Accelerated Parallel Processing software stack provides end users and developers with a complete flexible suite of tools to leverage the processing power in AMD GPUs AMD Accelerated Parallel Processing software embraces open systems open platform standards The AMD Accelerated Parallel Processing open platform strategy enables AMD technology partners to develop and provide third party development tools The software includes the following components e OpenCL compiler and runtime e Debugging and Performance Profiling Tools AMD CodeXL e Performance Libraries AMD Accelerated Parallel Processing Math Library APPML for optimized NDRange specific algorithms AMD Accelerated Parallel Processing OpenCL Programming Guide 2 1 Copyright 2013 Advanced Micro Devices Inc All rights reserved Work Items Work Groups 2 2 GPU DEVICE AMD ACCELERATED PARALLEL PROCESSING The latest generations of AMD GPUs use unified shader architectures capable of running diffe
139. ing the current working directory of the application D name Predefine name as a macro with definition 1 For D name definition the contents of definition are tokenized and processed as if they appeared during the translation phase three in a define directive In particular the definition is truncated by embedded newline characters D options are processed in the order they are given in the options argument to clBuildProgram AMD Developed Supplemental Compiler Options The following supported options are not part of the OpenCL specification g This is an experimental feature that lets you use the GNU project debugger GDB to debug kernels on x86 CPUs running Linux or cygwin minGW under Windows For more details see Chapter 4 Debugging OpenCL This option does not affect the default optimization of the OpenCL code 00 Specifies to the compiler not to optimize This is equivalent to the OpenCL standard option cl opt disable f no bin source Does not generate OpenCL source in the source section For more information see Appendix C OpenCL Binary Image Format BIF v2 0 f no bin 1lvmir Does not generate LLVM IR in the 11vmir section For more information see Appendix C OpenCL Binary Image Format BIF v2 0 f no bin amdil Does not generate AMD IL in the amdil section For more information see Appendix C OpenCL Binary Image Format BIF v2 0 f no bin
140. into continuous buffers one per thread Each CPU thread serially walks through its buffer portion which results in good cache and prefetch behavior for each core On the GPU each thread walks the source buffer using a stride of the total number of threads As many threads are executed in parallel the result is a maximally coalesced memory pattern requested from the memory back end For example if each compute unit has 16 physical processors 16 uint4 requests are produced in parallel per clock for a total of 256 bytes per clock 1 20 Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing Copyright 2013 Advanced Micro Devices Inc All rights reserved 12 13 14 AMD ACCELERATED PARALLEL PROCESSING The kernel code uses a reduction consisting of three stages global to private privateto local which is flushed to global and finally global to global In the first loop each thread walks 8 global memory and reduces all values into a min value in private memory typically a register This is the bulk of the work and is mainly bound by global memory bandwidth The subsequent reduction stages are brief in comparison Next all per thread minimum values inside the work group are reduced to a local value using an atomic operation Access to the local value is serialized however the number of these operations is very small compared to the work of the previous reduction stage The threads withi
141. ith the GL context b CL DEVICES FOR GL CONTEXT KHR includes all GL context interoperable devices While it is possible to create as many GL contexts on a GPU do not create concurrently two GL contexts for two GPUs from the same process For OpenGL interoperability with OpenCL there is a strict order in which the OpenCL context is created and the texture buffer shared allocations can be made To use shared resources the OpenGL application must create an OpenGL context and afterwards an OpenCL context All resources GL buffers and textures created after the OpenCL context was created can be shared between OpenGL and OpenCL If resources are allocated before the OpenCL context was created they cannot be shared between OpenGL and OpenCL E 1 Under Windows This sections discusses CL GL interoperability for single and multiple GPU systems running under Windows AMD Accelerated Parallel Processing OpenCL Programming Guide E 1 Copyright 2013 Advanced Micro Devices Inc All rights reserved EJ AMD ACCELERATED PARALLEL PROCESSING Single GPU Environment 5 1 1 1 Creating CL Context from a GL Context E 2 Use GLUT windowing system or Win32 API for event handling Using GLUT 1 Use glutInit to initialize the GLUT library and negotiate a session with the windowing system This function also processes the command line options depending on the windowing system 2 Use wglGetCurrentCon
142. its execute 64 work items The Asynchronous Compute Engines ACEs manage the CUs a graphics command processor handles graphics shaders and fixed function hardware In pre GCN devices for a hardware overview see Appendix D Hardware overview of pre GCN devices each compute unit consists of a single vector unit each containing up to 16 processing elements Each processing element 2 2 2 AMD ACCELERATED PARALLEL PROCESSING which contains 4 or 5 ALUs could execute bundles of 4 or 5 independent instructions co issued in a VLIW Very Long Instruction Word format All the processing elements within a vector unit execute a single wavefront a group of 64 work items If operations within a wavefront contain dependencies they cannot be scheduled in the same clock cycle leaving some ALUs un utilized In such cases some processing elements and hence vector units remain under utilized In GCN devices the CUs are arranged in four vector unit arrays consisting of 16 processing elements each Each of these arrays executes a single instruction across each lane for each block of 16 work items That instruction is repeated over four cycles to make the 64 element vector called a wavefront Thus In GCN devices the four vector units within a CU can operate on four different wavefronts If operations within a wavefront include dependencies independent operations from different wavefronts can be selected to be assigned to a single vec
143. ization points are the mapping or un mapping of the SVM memory and kernel launch or completion This means that any updates are visible only at the end of the kernel or at the point of un mapping the region of memory Coarse grained buffer memory has a fixed virtual address for all devices it is allocated on The physical memory is allocated on Device Memory e For fine grained SVM the synchronization points include in addition to those defined for coarse grained SVM the mapping or un mapping of memory and atomic operations This means that updates are visible at the level of atomic operations on the SVM buffer for fine grained buffer SVM allocated with the CL MEM SVM ATOMICS flag or the SVM system i e anywhere in the SVM for fine grained system SVM Fine grained buffer memory has the same virtual address for all devices it is allocated on The physical memory is allocated on the Device Visible Host Memory If the fine grain buffer is allocated with the CL MEM SVM ATOMICS flag the memory will be GPU CPU coherent The OpenCL 2 0 specification mandates coarse grained SVM but not fine grained SVM For details the developer is urged to read Section 3 3 of the OpenCL 2 0 specification In OpenCL 2 0 SVM buffers shared between the host and OpenCL devices are created by calling c15VMA11oc or malloc new in the case of fine grain system support The contents of such buffers may include pointers into SVM buffers Pointer based d
144. ject 4 1 kernel symbols not visible 2 debugging kernels 2 OpenCL saaan a pgka i aiea pabaya hg 1 kernels in Windows 4 OpenCL programs 1 session GDB sample 3 hello world kernel sample 2 setting the environment 2 definition kernel 1 NDRange eeaeee spia paa m 1 wavefront 4 derived classes 3 device extension support listing 15 fission extension support in OpenCL 4 list function call query 2 no limit of number of command queues 4 relationship sample code 4 device optimal access pattern threading 19 devices R9 290X series 8 device specific operations kernel execution 4 program compilation 4 Direct Memory Access DMA SNGING AAP AA 9 signaling transfer is completed 9 transfers ss ose ALLAN LA eae ed 9 parallelization 9 directives extension name overrides 2 Older so dud ead edu ue da ae eaves 2 domains of synchronization 4 command queue 4 work items annaa 4 double copying memory bandwidth 7 double precision floating point performing operations
145. k group size is known to or required to divide evenly into the global size 6 11 Portability considerations 6 11 1 6 28 Migrating from OpenCL 1 2 to OpenCL 2 0 OpenCL 2 0 is backward compatible with OpenCL 1 2 Applications written on OpenCL 1 2 should run on OpenCL 2 0 without requiring any changes to the application OpenCL 2 0 includes changes in the runtime and the compiler In the runtime some new functions such as for SVM have been added In the compiler the cl std CL2 0 option is needed in order to compile OpenCL 2 0 kernels Chapter 6 OpenCL 2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING If a program uses the OpenCL 2 0 functions and if one compiles a kernel by using the cl std CL2 0 option the program will not build or compile on OpenCL 1 2 platforms If a program uses only OpenCL 1 2 functions and if one compiles a kernel without the cl std CL2 0 option then the program should run on OpenCL 2 0 platforms 6 11 2 Identifying implementation specifics Applications can query for the OpenCL extensions and use the values returned from the OpenCL functions For instance clGetSupportedImageFormats will return all image formats supported by OpenCL The supported images may differ across implementations Similarly clGetDeviceInfo with the CL DEVICE EXTENSIONS parameter returns all the supported extensions The supported extensions may differ across impl
146. l Processing Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING minp reduc clCreateKernel program Create input output and debug clCreateKernel program buffers minp NULL reduce NULL EM COPY HOST PTR src_buf clCreateBuffer context CL MEM READ ONLY CL M num src items sizeof cl_uint src ptr NULL dst buf clCreateBuffer context CL MEM READ WRITE num groups sizeof cl uint NULL NULL dbg buf clCreateBuffer context CL MEM WRITE ONLY global work size sizeof cl_uint NULL NULL clSetKernelArg clSetKern clSetKernelArg minp clSetKernelArg minp minp clSetKernelArg minp clSetKernelArg minp clSetKernelArg minp cils lArg reduc tKernelArg reduc CPerfCounter t t Reset t Start oM NB WN H x 6 Main timing loop define NLOOPS 500 cl event ev int nloops NLOOPS while nloops sizeof void sizeof void 1 sizeof cl uint sizeof void Ef sizeof num src items sizeof dev O 1 sizeof void sizeof void clEnqueueNDRangeKernel queue minp 1 NULL amp global work size amp local work size O NULL amp ev clEnqueueNDRangeKernel clFinish queue t Stop 1 6
147. le A2 Extension Support for Older AMD GPUs and CPUs x86 CPU Extension Juniper Redwood Cedar with SSE2 or later cl khr atomics Yes Yes Yes Yes Cl ext atomic counters 32 Yes Yes Yes No Cl khr gl sharing Yes Yes Yes Yes cl khr byte addressable store Yes Yes Yes Yes cl ext device fission No No No Yes cl amd device attribute query Yes Yes Yes Yes cl khr fp64 No No No Yes cl amd fp64 No No No Yes cl amd vec3 Yes Yes Yes Yes Images Yes Yes Yes Yes cl khr d3d10 sharing Yes Yes Yes Yes cl amd media ops Yes Yes Yes Yes cl amd media ops2 Yes Yes Yes Yes cl amd printf Yes Yes Yes Yes cl amd popcnt Yes Yes Yes Yes cl khr 3d image writes Yes Yes Yes No Platform Extensions cl khr icd Yes Yes Yes Yes cl amd event callback Yes Yes Yes Yes cl amd offline devices Yes Yes Yes Yes 1 ATI Radeon HD 5700 series AMD Mobility Radeon HD 5800 series AMD FirePro V5800 series AMD Mobility FirePro M7820 2 ATI Radeon HD 5600 Series ATI Radeon HD 5600 Series ATI Radeon HD 5500 Series AMD Mobility Radeon HD 5700 Series AMD Mobility Radeon HD 5600 Series AMD FirePro V4800 Series AMD FirePro V3800 Series AMD Mobility FirePro M5800 ATI Radeon HD 5400 Series AMD Mobility Radeon HD 5400 Series Available on all devices that have double precision including all Southern Island devices Environment variable CPU IMAGE SUPPORT must be set 9 RO A 16 Appendix A OpenCL Optional Extensions Cop
148. llel work A kernel can be executed as a function of multi dimensional domains of indices Each element is called a work item the total number of indices is defined as the global work size The global work size can be divided into sub domains called work groups and individual work items within a group can communicate through global or locally shared memory Work items are synchronized through barrier or fence operations Figure 1 1 is a representation of the host device architecture with a single platform consisting of a GPU and a CPU An OpenCL application is built by first querying the runtime to determine which platforms are present There can be any number of different OpenCL implementations installed on a single system The desired OpenCL platform can be selected by matching the platform vendor string to the desired vendor name such as Advanced Micro Devices Inc The next step is to create a context As shown in Figure 1 1 an OpenCL context has associated with it a number of compute devices for example CPU or GPU devices Within a context OpenCL guarantees a relaxed consistency between these devices This means that memory objects such as buffers or images are allocated per context but changes made by one device are only guaranteed to be visible by another device at well defined synchronization points For this OpenCL provides events with the ability to synchronize on a given event to enforce the correct order of execution
149. lor map of the specified visual type for the screen on which the specified window resides and returns the colormap ID associated with it Note that the specified window is only used to determine the screen Use XCreateWindow to create an unmapped sub window for a specified parent window returns the window ID of the created window and causes the X server to generate a CreateNotify event The created window is placed on top in the stacking order with respect to siblings Use XMapWindow to map the window and all of its sub windows that have had map requests Mapping a window that has an unmapped ancestor does not display the window but marks it as eligible for display when the ancestor becomes mapped Such a window is called unviewable When all its ancestors are mapped the window becomes viewable and is visible on the screen if it is not obscured by another window Use g1XCreateContextAttribsARB function to initialize the context to the initial state defined by the OpenGL specification and return a handle to it This handle can be used to render to any GLX surface Use g1XMakeCurrent to make argrument3 GLXContext the current GLX rendering context of the calling thread replacing the previously current context if there was one and to attach argument3 GLXcontext to a GLX drawable either a window or a GLX pixmap Use clGetGLContextInfoKHR to get the number of OpenCL OpenGL interoperability devices corresponding to the window created in f
150. mpute Unit n i E Proc Elem ALU Private Memory j Reg Files 1 A Y Local Mem LDS p DMA Compute Device Memory VRAM GLOBAL MEMORY CONSTANT MEMORY 4 Figure 1 2 Devices Interrelationship of Memory Domains for Southern Islands Figure 1 3 illustrates the standard dataflow between host CPU and GPU NOT D O U Figure 1 3 Dataflow between Host and GPU There are two ways to copy data from the host to the GPU compute device memory Implicitly by using clEnqueueMapBuffer and clEnqueueUnMapMemObject Explicitly through clEnqueueReadBuffer and cll cll EnqueueReadImage clEnqueueWriteImage EnqueueWriteBuffer Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING When using these interfaces it is important to consider the amount of copying involved There is a two copy processes between host and PCle and between PCle and GPU compute device With proper memory transfer management and the use of system pinned memory host CPU memory remapped to the PCle memory space copying between host CPU memory and PCle memory can be skipped Double copying lowers the overall system memory bandwidth In GPU compute device programming pipelining and other techniques help reduce these bottlenecks See the AMD OpenCL Optimization R
151. mpute units 1 local work size 1 else 1 6 Example Programs Copyright 2013 Advanced Micro Devices Inc All rights reserved CL DEVICE MAX COMPUTE UNITS b gt gt 16 1 thread per core 1 23 AMD ACCELERATED PARALLEL PROCESSING cl uint ws 64 global work size compute units 4 7 4 ws 7 wavefronts per SIMD while num src items 4 global work size 0 global work size ws local work size ws num groups global work size local work size Create a context and command queue on that device context clCreateContext NULL 1 amp device NULL NULL NULL queue clCreateCommandQueue context device 0 NULL Minimal error check if queue NULL f printf Compute device setup failed n return 1 Perform runtime source compilation and obtain kernel entry point program clCreateProgramWithSource context 1 amp kernel source NULL NULL Tell compiler to dump intermediate il and isa GPU files ret clBuildProgram program 1 amp device save temps NUL NULL 5 Print compiler error messages if ret CL SUCCESS printf clBuildProgram failed d n ret char buf 0x10000 clGetProgramBuildInfo program device CL PROGRAM BUILD LOG 0x10000 buf NULL printf n s n buf return 1 1 24 Chapter 1 OpenCL Architecture and AMD Accelerated Paralle
152. n a work group are synchronized through a local barrier The reduced min value is stored in global memory After all work groups are finished a second kernel reduces all work group values into a single value in global memory using an atomic operation This is a minor contributor to the overall runtime Example Code 3 Copyright c 2010 Advanced Micro Devices Inc All rights reserved include include include include include lt CL cl h gt lt stdio h gt lt stdlib h gt lt time h gt Timer h define NDEVS 2 A parallel min kernel that works well on CPU and GPU const char kernel source An pragma OPENCL EXTENSION cl khr local int32 extended atomics enable Mn pragma OPENCL EXTENSION cl khr global int32 extended atomics enable Mn Xn 9 The source buffer is accessed as 4 vectors An An kernel void minp global uint4 src Na global uint gmin Na local uint lmin Mn global uint abg Na int nitems An uint dev Na ant 10 Set up _ global memory access pattern Na Na uint count nitems 4 get global size 0 Na uint idx dev 0 get global id 0 count Mn get global id 0 n n uint stride dev 0 1 get global size 0 n uint pmin uint 1 Na Na 11 First compute private min for this work item Mn Na 1 6 Example Programs 1
153. n is not required to detect name collision with the user specified kernel mangled names involved 5 3 6 Exceptions Exceptions as per Section 15 of the C language specification are not supported The keywords try catch and throw are reserved and the OpenCL C compiler must produce a static compile time error if they are used in the input program 5 3 7 Libraries Support for the general utilities library as defined in Sections 20 21 of the C language specification is not provided The standard C libraries and STL library are not supported 5 3 8 8 Dynamic Operation Features related to dynamic operation are not supported e the virtual modifier OpenCL C prohibits the use of the virtual modifier Thus virtual member functions and virtual inheritance are not supported e Dynamic cast that requires runtime check e Dynamic storage allocation and deallocation 5 4 Examples 5 4 1 Passing a Class from the Host to the Device and Back The class definition must be the same on the host code and the device code besides the members type in the case of vectors If the class includes vector data types the definition must conform to the table that appears on Section 6 1 2 of the OpenCL Programming Specification 1 2 Corresponding API type for OpenCL Language types Example Kernel Code Class Test setX int value private int x kernel foo global Test InClass If get global id 0 0 InClass
154. nCL 2 0 The AMD implementations support this combination The remaining combinations are optional in OpenCL 2 0 When not using the mandatory combination CL SRGBA CL UNORM INT8 the clGetSupportedImageFormats function must be used to get a list of supported image formats and data types before using the sRGB image Creating sRGB image objects is similar to creating an image object of existing supported channel order with OpenCL 2 0 The following snippet shows how to create CL sRGBA image objects by using the read image call 6 9 Image Enhancements 6 25 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING cl image format imageFormat imageFormat image channel data type CL UNORM INT8 imageFormat image channel order CL sRGBA cl mem imageObj clCreateImage context A valid OpenCL context CL MEM READ ONY CL MEM COPY HOST PTR amp imageFormat amp desc cl image desc pSrcImage An pointer to the image data amp retErr Returned error code A new sRGB image can also be created based on an existing RGB image object so that the kernel can implicitly convert the sRGB image data to RGB This is useful when the viewing pixels are sRGB but share the same data as the existing RGB image After an sRGB image object has been created the read imagef call can be used in the kernel to read it transparently read imagef explicitly convert
155. nCL immediate representation LLVM IR e coment for storing the OpenCL version and the driver version that created the binary The BIF can have other special sections for debugging etc It also contains several ELF special sections such as e text for storing the executable e rodata for storing the OpenCL runtime control data e other ELF special sections required for forming an ELF for example strtab symtab shstrtab By default OpenCL generates a binary that has LLVM IR and the executable for the GPU 11vmir amdil and text sections as well as LLVM IR and the executable for the CPU 11vmir and text sections The BIF binary always contains a comment section which is a readable C string The default behavior can be changed with the BIF options described in Section C 2 BIF Options page C 3 The LLVM IR enables recompilation from LLVM IR to the target When a binary is used to run on a device for which the original program was not generated and the original device is feature compatible with the current device OpenCL recompiles the LLVM IR to generate a new code for the device Note that the LLVM IR is only universal within devices that are feature compatible in the same device type not across different device types This means that the LLVM IR for the CPU is not compatible with the LLVM IR for the GPU The LLVM IR for a GPU works only for GPU devices that have equivalent feature sets BIF2 0 is sup
156. nces These additions allow synchronization between work items in different work groups as well as fine grained synchronization with the host using atomic operations on memory in fine grained SVM buffers allocated with the CL MEM SVM ATOMICS flag The following examples to illustrate the use of atomics are part of the AMD APP SDK Chapter 6 OpenCL 2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING 6 5 2 1 Atomic Loads Stores This sample illustrates atomic loads stores with the use of memory orders The first step is to create this memory on the host T buffer int clSVMAllocP context CL MEM SVM FINE GRAIN BUFF N 1 sizeof int 4 EJ Bol M atomicBuffer int clSVMAllocP context CL MEM SVM FINE GRAIN BUFFER CL MEM SVM ATOMICS N 1 sizeof int 4 Note the flags sent as parameters CL MEM SVM FINE GRAIN BUFFER and CL MEM SVM ATOMICS The following kernel runs on all work items in parallel It will atomically load atomi cBuf fer 0 check whether its value is 99 and wait till it is 99 The acquire memory order is used to indicate that the latest update must be done on the host and to ensure that the local L1 cache is not read from This will be made 99 by the host CPU by std atomic store explicit std atomic lt int gt satomicBuffer 0 99 std memory order release Th
157. nction ulongn amd qsad ulongn src0 uintn srcl ulongn src2 Description uchar8 src0u8 as uchar8 src0 s0 ushort4 src2ul6 as_ushort4 src2 s0 ushort4 dstul6 dstul6 s0 amd sad as uint src0u8 s0123 srcl s0 src2ul6 s0 dstul6 sl amd sad as uint src0u8 s1234 srcl s0 src2ul6 s1 dstul6 s2 amd sad as uint src0u8 s2345 srcl s0 src2ul6 s2 dstul6 s3 amd sad as uint src0u8 s3456 srcl s0 src2ul6 s3 dst s0 as uint2 dstul6 A similar operation is applied to other components of the vectors A 8 AMD Vendor Specific Extensions A 9 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Built in Function ulongn amd mgsad ulongn src0 uintn srcl ulongn src2 Description uchar8 src0u8 as uchar8 src0 s0 ushort4 src2ul6 as_ushort4 src2 s0 ushort4 dstul6 dstul6 s0 amd msad as uint src0u8 s0123 srcl s0 src2u16 s0 dstul6 sl amd msad as uint src0u8 s1234 srcl s0 src2u16 s1 dstul6 s2 amd msad as uint src0u8 s2345 srcl s0 src2u16 s2 dstul6 s3 amd msad as uint src0u8 s3456 srcl s0 src2u16 s3 dst s0 as uint2 dstul6 A similar operation is applied to other components of the vectors Built in Function uintn amd sadw uintn src0 uintn srcl uintn src2 Description ushort2 src0Oul6 as ushort2 src0 s0 ushort2 srclul6 as_ushort2 srcl s0 dst s0 src2 s0 abs src0u16 s0 srclul6 s0 abs srcOul6 s1 srclulb s1
158. ndex 4 f no bin IIvmir 4 f no bin source 4 D paia d debeo ER qd 4 900 issus see ede x asc dus 4 save temps 4 compiling an OpenCL application 2 C files 3 kernels AA 1 on Linux building 32 bit object files on a 64 bit sys LOT CR ch a pese m Bieta pace DW WS 3 linking to a 32 bit library 3 linking to a 64 bit library 3 OpenCL on Linux 3 OpenCL on Windows 2 Intel C C compiler 2 setting project properties 2 Visual Studio 2008 Professional Edition 2 the host program 2 computation data parallel model 2 compute device structure GPU ges ae a hadley diamant 6 2 compute kernel data parallel granularity 2 definition isse REOR ates 1 strengths computationally intensive applications 1 wavefronts 2 workgroups 2 compute unit MAPPING sac seven sav a a dau 2 stream cores 3 compute unites number in AMD GPU 8 compute units 290X devices 6 independent operation 4 number in AMD GPUs 8 structured in AMD GPUS 8 constants caching 2 nem ahahha de Khan 10 command queue elements 9 constraints of the c
159. ne 2002 CPU generic without SSE3 E 2003 CPU generic with SSE3 typedef enum CALtargetEnum CAL TARGET 600 0 lt R600 GPU ISA CAL TARGET 610 1 lt RV610 GPU ISA CAL TARGE 2 lt RV630 GPU ISA CAL TARGET 3 RV670 GPU ISA CAL TARGET 4 R700 class GPU ISA CAL TARGE 5 RV770 GPU ISA CAL TARGET 6 lt RV710 GPU ISA CAL TARGET 7 lt RV730 GPU ISA CAL TARGE ESS 8 CYPRESS GPU ISA CAL TARGE PER 9 JUNIPER GPU ISA CAL TARG WOOD 10 REDWOOD GPU ISA CAL TARG 11 lt CEDAR GPU ISA CAL_TARGE 12 lt SUMO GPU ISA CAL TARG PERSUMO 13 lt SUPERSUMO GPU ISA CAL TARGET WRESTLER 14 lt WRESTLER GPU ISA CAL_TARG T CAYMAN 15 lt CAYMAN GPU ISA C KAUAI 16 KAUAI GPU ISA a CAL 17 BARTS GPU ISA CAL 18 TURKS GPU ISA CAL 19 CAICOS GPU ISA CAL TARGET TAHITI 20 TAHITI GPU ISA CAL TARGET PITCAIRN 21 lt PITCAIRN GPU ISA CAL TARGET CAPEVERDE 22 CAPE VERDE GPU ISA CAL TARGET DEVASTATOR 23 lt DEVASTATOR GPU ISA CAL TARGET SCRAPPER 24 SCRAPPER GPU ISA CAL TARGET OLAND 25 OLAND GPU ISA CAL TARGET BONAIRE 26
160. nel instructions 9 overvieW 8 skip copying between host memory and PCle memory aia aes dees KGG Fi throughput 8 performance work groups 2 pipelining 7 platform vendor string remains constant for a particular vendor s implementation 3 searching for desired OpenCL platform 3 vs platform name string 3 point barrier in the code 2 population count extension 7 post processing incomplete AMD IL 1 Index 12 pre ICD code snippet 1 2 processing by command processors 7 processing elements SIMD arrays 8 program examples eee 10 simple buffer write 10 programming basic steps with minimum code 10 GPU amba ra Rated das edad 2 techniques simple tests parallel min function 19 programming model AMD Accelerated Parallel Processing 2 OpenCL 3 executing kernels for specific devices 3 queues of commands 3 reading writing data 3 project property settings compiling on Windows 2 Q querying AMD specific device attributes 5 extensions for a list of devices 2 for a platform 1 OpenCL eroras sata KA ha L
161. nel kernel cl Buffer bufX cl Buffer bu Y D I NT ET PRES RE RER The saxpy kerne NP AIIB IB HH A string kernelStr kernel void saxpy const global float x n global float y n i const float a n n uint gid get global id 0 n ylgid a x gid y gid n Ng nG ANAN ANAK AKA A KAKA TT BABABA AABANG BABA BAKA BANANA BABA I TT IT 17 Allocate and initialize memory on the host MIMMMMMMMIMMPM P MC PMOILLILLLPBG GMAMMMPMPMPMPMPMPMPMPMPMPMPMMMMMIIIIII LP Mall void initHost size t sizeInBytes length sizeof cl float pX cl float malloc sizeInBytes if pX NULL throw string Error Failed to allocate input memory on host n pY cl float malloc sizeInBytes if pY NULL throw string Error Failed to allocate input memory on host n for int i 0 i length i pX i pY i cl float i cl float length 1 i printVector X pX length printVector Y pY length HIM B MP Bg MIEIPLPMPMPMCPM OMlMMIPIIIIIIIIITPIPIMPBPGMMlAMlMllllllllllllllthd Release host memory HT PH HE HH HH HH LALA ALLA LALLA void cleanupHost if pX free pX pX NULL if pY NULL free pY pY NULL void main int argc char argv try VMUIMMMMMMIMIIIIMMPMLBMCBPMAIPH IOG PM LIMIIIIII I TL MIB IT I EM MP Mg M Allocate and initialize memory on the host 1 6 Exam
162. neously One work group can request up to 32 kB memory Reads across wavefront are dispatched over four cycles in waterfall The high bandwidth of the LDS memory is achieved not only through its proximity to the ALUs but also through simultaneous access to its memory banks Thus it is possible to concurrently execute 32 write or read instructions each nominally 32 bits extended instructions read2 write2 can be 64 bits each If however more than one access attempt is made to the same bank at the same time a bank conflict occurs In this case for indexed and atomic operations hardware prevents the attempted concurrent accesses to the same bank by turning them into serial accesses This decreases the effective bandwidth of the LDS For maximum throughput optimal efficiency therefore it is important to avoid bank conflicts A knowledge of request scheduling and address mapping is key to achieving this Dataflow in Memory Hierarchy Figure 1 5 is a conceptual diagram of the dataflow within the memory structure Private Private utes Work Work Images Text per Unit Color Buffer Depth Buffer Texture Global Write Only Coherence Cache L2 Global Memory VRAM Figure 1 5 Memory Hierarchy Dataflow To load data into LDS from global memory it is read from global memory and placed into the work item s registers then a store is performed to LDS Similarly to store data into global memory
163. ng the modifier vn where n can be 2 3 4 8 or 16 This modifier appears before the original conversion specifier for the vector s component type for example to print a float4 v4f Since vn is a conversion specifier it is valid to apply optional flags such as field width and precision just as it is when printing the component types Since a vector is an aggregate type the comma separator is used between the components Odd ua CAII Appendix A OpenCL Optional Extensions Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING A 8 12 cl amd predefined macros The following macros are predefined when compiling OpenCL C kernels These macros are defined automatically based on the device for which the code is being compiled GPU devices WinterPark _ BeaverCreek Turks m Caicos J Tahiti Pitcairn _ Capeverde _ Cayman Barts Cypress _ Juniper Redwood Cedar ATI RV770 ATI RV730 J ATI RV710 Loveland GPU CPU devices __ CPU X86 X86 64 Note that GPU or CPU are predefined whenever a GPU or CPU device is the compilation target An example kernel is provided below pragma OPENCL EXTENSION cl amd printf enable const char getDeviceName ifdef Cayman return Cayman elif Barts return Barts elif Cypress return Cypress elif defined Juniper return Juniper lif
164. nst gentype Write packet specified by ptr to pipe p ptr Returns 0 if write_pipe is successful and a negative value otherwise Pipe Reservation functions reserve_id_t reserve_read_pipe pipe gentype Reserve num_packets entries for reading p num packets from or writing to pipe p Returns a valid eserve id t reserve write pipe pipe gentype reservation ID if the reservation is uint num packets successful eserve id t work group reserve read pipe The reserved pipe entries are referred to pipe gentype p uint num packets by indices that go from 0 reserve id t work group reserve write pipe num packets 1 pipe gentype p uint num packets reserve id t sub group reserve read pipe Kpipe gentype p uint num packets reserve id t sub group reserve write pipe pipe gentype p uint num packets Chapter 6 OpenCL 2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved 6 6 3 Usage AMD ACCELERATED PARALLEL PROCESSING bool is valid reserve id reserve id t reserve id Return true if reserve id is a valid reservation ID and false otherwise Read and write pipe functions using reservation indices int read pipe pipe gentype p reserve id t reserve id uint index gentype ptr Read packet from the reserved area of the pipe referred to by reserve id and index into ptr he reserved pipe entries are referred to by indices that go from O num packets 1 Returns O if rea
165. ntext to a given pixel format specification d Use SetPixelFormat to set the pixel format of the specified device context to the format specified e Use wglCreateContext to create a new OpenGL rendering context from device context HDC f Use wglMakeCurrent to bind the GL context created in the above step as the current rendering context g Use clGetGLContextInfoKHR See Section 9 7 of the OpenCL Specification 1 1 and CL CURRENT DEVICE FOR GL CONTEXT KHR parameter to get the number of GL associated devices for CL context creation If the number of devices is zero go to the next display in the loop Otherwise use clGetGLContextInfoKHR See Section 9 7 of the OpenCL Specification 1 1 and the CL CURRENT DEVICE FOR GL CONTEXT KHR parameter to get the device ID of the CL device associated with OpenGL context h Use clCreateContext See Section 4 3 of the OpenCL Specification 1 1 to create the CL context of type c1 context The following code demonstrates how to use WIN32 Windowing API in CL GL interoperability on multi GPU environment int xCoordinate 0 int yCoordinate 0 for deviceNum 0 EnumDisplayDevices NULL deviceNum amp dispDevice E 1 Under Windows E 5 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING 0 deviceNum if dispDevice StateFlags amp DISPLAY DEVICE MIRRORING DRIVER
166. o GPU command processor 9 DMA engine 9 management memory ssec nue NGANGA 7 work item to fetch unit 9 U unified shader architecture 2 un normalized addresses 10 V variable output counts NDRange 1 variadic arguments use of in the built in printf 12 vector data types parallelism 2 vector instructions 7 vendor platform vendor string 3 vendor name matching platform vendor string 3 vendor specific extensions AN BA 4 Very Long Instruction Word VLIW instruction 4 work item 4 Visual Studio 2008 Professional Edition 4 compiling OpenCL on Windows 2 developing application code 4 VRAM global memory 9 Index 14 W wavefront block of work items 4 combining paths 4 concept relating to compute kernels 2 definition 4 8 mask ccr usd ku Ae Wap UR 5 masking 4 pipelining work items on a stream core 4 relationship to work group 4 relationship with work groups 4 required number spawned by GPU 4 SIZE a AN io cu oh a NG 4 size for optimum hardware usage 4 size on AMD GPUs 4 total execution time
167. ode is not yet possible however the C C call stack can be seen in the Visual Studio call stack view which shows what led to the API function call To start kernel debugging there are several options One is to Step Into F11 the appropriate c1EnqueueNDRangeKernel function call Once the kernel starts AMD Accelerated Parallel Processing OpenCL Programming Guide 4 1 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING executing debug it like C C code stepping into out of or over function calls in the kernel setting source breakpoints and inspecting the locals autos watch and call stack views If you develop on Linux or do not use Visual Studio using the CodeXL stand alone application is very similar After installing CodeXL launch the CodeXL stand alone application from the installation folder On the start page select Create New Project and use the browse button next to the Executable Path field to select your application Click the Go F5 button and follow the instructions above to enter kernel debugging 4 2 Debugging CPU Kernels with GDB 4 2 1 4 2 2 4 2 This section describes an experimental feature for using the GNU project debugger GDB to debug kernels on x86 CPUs running Linux or cygwin minGW under Windows Setting the Environment The OpenCL program to be debugged first is compiled by passing the g 00 or g cl opt disa
168. oject Properties C C Additional Include Directories These must include AMDAPPSDKROOT include for OpenCL headers Optionally they can include 5 AMDAPPSDKSAMPLESROOT include for SDKUtil headers e Project Properties C C Preprocessor Definitions These must define AMD OS WIN Chapter 3 Building and Running OpenCL Programs Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING e Project Properties Linker Additional Library Directories These must include 5 AMDAPPSDKROOT 1ib x86 for OpenCL libraries Optionally they can include AMDAPPSDKSAMPLESROOT 1ib x86 for SDKUtil libraries e Project Properties Linker Input Additional Dependencies These must include OpenCL 1ib Optionally they can include SDKUtil lib 3 1 2 Compiling on Linux To compile OpenCL applications on Linux requires that the gcc or the Intel C compiler is installed There are two major steps to do this compiling and linking 1 Compile all the C files Template cpp and get the object files For 32 bit object files on a 32 bit system or 64 bit object files on 64 bit system g o Template o DAMD OS LINUX c Template cpp ISAMDAPPSDKROOT include For building 32 bit object files on a 64 bit system g o Template o DAMD OS LINUX c Template cpp ISAMDAPPSDKROOT include 2 Link all the object files generated in the previous step to the OpenCL library
169. or component 0 amp OxFF d 8 amp OxFF d S S srcl i gt gt srcl i gt gt rcl i 16 amp OxFF 4 rcl i gt gt 24 amp OxFF Built in function amd sad uintn 0 amp OxFF 4 8 amp OxFF 4 src2 i 0 amp 1 gt gt 1 lt lt 0 src2 i gt gt 8 amp 1 gt gt 1 lt lt 8 src2 i gt gt 16 amp 1 gt gt 1 lt lt 16 job src2 i gt gt 24 amp 1 gt gt 1 lt lt 24 amd sad uintn src0 uintn srcl uintn src2 Return value for each vector component src2 i abs src0 ai abs srcO i abs srcO i abs srcO i gt gt 0 amp OxFF gt gt 8 amp OxFF gt gt 16 amp OxFF gt gt 24 amp OxFF Built in function amd sad4 A 8 uint amd sad4 uint4 a uint4 b uint c Return value for each vector component src2 i abs srcO i abs srcO i abs srcO i abs srcO i gt gt 0 amp OxFF gt gt 8 amp OxFF gt gt 16 amp OxFF gt gt 24 amp OxFF Appendix A OpenCL Optional Extensions Copyright 2013 Advanced Micro Devices Inc All rights reserved srcl i gt gt 0 amp OxFF srcl i gt gt 8 amp OxFF srcl i gt gt 16 amp OxFF srcl i gt gt 24 amp OxFF srcl i gt gt 0 amp OxFF srcl i gt gt 8 amp OxFF srcl i gt g
170. penCL runtime interface Figure 3 1 The front end translates the OpenCL source to LLVM IR It keeps OpenCL specific information as metadata structures For example to debug kernels the front end creates metadata structures to hold the debug information also a pass is inserted to translate this into LLVM debug nodes which includes the line numbers and source code mapping The front end supports additional data types int4 float8 etc additional keywords kernel global etc and built in functions get global id barrier etc Also it performs additional syntactic and semantic checks to ensure the kernels meet the OpenCL specification The input to the LLVM linker is the output of the front end and the library of built in functions This links in the built in OpenCL functions required by the source and transfers the data to the optimizer which outputs optimized LLVM IR For GPU processing the LLVM IR to AMD IL module receives LLVM IR and generates optimized IL for a specific GPU type in an incomplete format which is passed to the OpenCL runtime along with some metadata for the runtime layer to finish processing For CPU processing LLVM AS generates x86 binary Compiling on Windows To compile OpenCL applications on Windows requires that Visual Studio 2008 Professional Edition or later or the Intel C C compiler are installed All C files must be added to the project which must have the following settings e Pr
171. ple Programs 1 17 Copyright 2013 Advanced Micro Devices Inc All rights reserved 1 18 break AMD ACCELERATED PARALLEL PROCESSING LOI T IT TT I TTL LITT TT I TT TT LT TT p PL E IP IP LP Pg g PL P Bg M M PP Pg Mg MOML P gg M PC M g A fg initHost VMIMMMMMMIEEEEIlMMIIIIIPMPMCLGM MUEWMWEPLII I I EEE Find the platform VMIMMMMMMIEEEPPHLMIPIIPPMCP MP PM TT LT I TT PM M P PP PP P M P P P P P P P P gg eM LL B 4g cl Platform get amp platforms std vector lt cl Platform gt iterator iter for iter platforms begin iter platforms end iter if strcomp iter getInfo CL PLATFORM VENDOR c str Advanced Micro Devices Inc VMMUMMMMIMMMMMPMMIHLIPB BP ULL IL P ML gg PP B M MM P P MM P P M HL P Bg MI ga Create an OpenCL context SISI P M PPM IE IE EI cl context properties cps 3 CL CONTEXT PLATFORM cl context properties iter 0 0 context cl Context CL DEVICE TYPE GPU cps VMIIMMMMMMIMIMMIMMMPMPMPMPMPPMPMMPMMMPMMMIIIIIIMIL PB Maa E HE HP PG Mg lll b gl Detect OpenCL devices eec AA APA AAP devices context getInfo lt CL CONTEXT DEVICES Ee AA EDL GEN Create an OpenCL command queu HIPH HH HH M queue cl CommandQueue context devices 0 VMUIMMMMMMMMPM MC M KAAU TT LI TT TT TT BABABA P M MM PP P NAKABABA KA GM Ig Bg Create OpenCL memory buffers LOT T IT TT I TT I I TT TT TT D g MI KAKA BABABA BAKA MP MP Mab lE P P P lll ll buf
172. ported since Stream SDK 2 2 AMD Accelerated Parallel Processing OpenCL Programming Guide C 1 Copyright 2013 Advanced Micro Devices Inc All rights reserved C 1 1 C 2 AMD ACCELERATED PARALLEL PROCESSING Executable and Linkable Format ELF Header For the ELF binary to be considered valid the AMD OpenCL runtime expects certain values to be specified The following header fields must be set for all binaries that are created outside of the OpenCL framework Table C 1 ELF Header Fields Field Value Description e ident EI CLASS ELFCLASS32 BIF can be either 32 bit ELF or ELFCLASS64 64bit ELF e ident EI DATA ELFDATA2LSB BIF is stored in little Endian order e ident EI_OSABT ELFOSABI NONE Not used e ident EI ABIVERSION 0 Not used e type ET NONE Not used e machine oclElfTargets Enum CPU GPU machine ID E version EV CURRENT Must be EV CURRENT e entry 0 Not used E phoff 0 Not used e flags 0 Not used E phentsize 0 Not used E phnum 0 Not used The fields not shown in Table C 1 are given values according to the ELF Specification The e machine value is defined as one of the oclElfTargets enumerants the values for these are Appendix C OpenCL Binary Image Format BIF v2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved C 1 2 Bitness AMD ACCELERATED PARALLEL PROCESSING 1001 CaltargetEnum GPU e machi
173. ptions during the runtime AMD OCL BUILD OPTIONS Overrides the CL options specified in clBuildProgram AMD OCL BUILD OPTIONS APPEND Appends options to those specified in clBuildProgram 3 2 Running the Program The runtime system assigns the work in the command queues to the underlying devices Commands are placed into the queue using the clEnqueue commands shown in the listing below OpenCL API Function Description clCreateCommandQueue Create a command queue for a specific device CPU GPU clCreateProgramWithSource Create a program object using the source code of the clCreateProgramWithBinary application kernels clBuildProgram Compile and link to create a program executable from the program source or binary clCreateKernel Creates a kernel object from the program object clCreateBuffer Creates a buffer object for use via OpenCL kernels clSetKernelArg Set the kernel arguments and enqueue the kernel in a clEnqueueNDRangeKernel command queue clEnqueueReadBuffer Enqueue a command in a command queue to read from clEnqueueWriteBuffer a buffer object to host memory or write to the buffer object from host memory clEnqueueWaitForEvents Wait for the specified events to complete 3 2 Running the Program 3 5 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING The commands can be broadly classified into three categories
174. r non graphics fields such as physics Al modeling HPC and various other computationally intensive applications In a compute kernel the work item spawn order is sequential This means that on a chip with N work items per wavefront the first N work items go to wavefront 1 the second N work items go to wavefront 2 etc Thus the work item IDs for wavefront K are in the range K N to K 1 N 1 AMD Accelerated Parallel Processing OpenCL Programming Guide 1 1 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Term Description wavefronts and Wavefronts and work groups are two concepts relating to compute work groups kernels that provide data parallel granularity A wavefront executes a number of work items in lock step relative to each other Sixteen work items are executed in parallel across the vector unit and the whole wavefront is covered over four clock cycles It is the lowest level that flow control can affect This means that if two work items inside of a wavefront go divergent paths of flow control all work items in the wavefront go to both paths of flow control Grouping is a higher level granularity of data parallelism that is enforced in software not hardware Synchronization points in a kernel guarantee that all work items in a work group reach that point barrier in the code before the next statement is executed Work groups are composed of wavefronts B
175. rallel Processing cl khr d3d10 sharing allows association of D3D10 context or share group with CL context for interoperability AT cl ext Extensions e cl ext device fission Support for device fission in OpenCL For more information about this extension see http www khronos org registry cl extensions ext cl ext device fission txt A 8 AMD Vendor Specific Extensions This section describes the AMD vendor specific extensions A 8 1 cl amd fp64 Before using double data types double precision floating point operators and or double precision floating point routines in OpenCL C kernels include the pragma OPENCL EXTENSION cl amd fp64 enable directive See Table A 1 for a list of supported routines A 8 2 cl amd vec3 This extension adds support for vectors with three elements float3 short3 char3 etc This data type was added to OpenCL 1 1 as a core feature For more details see section 6 1 2 in the OpenCL 1 1 or OpenCL 1 2 spec A 8 3 cl amd device persistent memory This extension adds support for the new buffer and image creation flag CL MEM USE PERSISTENT MEM AMD Buffers and images allocated with this flag reside in host visible device memory This flag is mutually exclusive with the flags CL MEM ALLOC HOST PTR and CL MEM USE HOST PTR A 4 Appendix A OpenCL Optional Extensions Copyright 2013 Advanced Micro Devices Inc All rights reserved
176. rectives to enable or disable extensions 2 distributing the kernel 1 enqueued commands 5 ensuring the kernels meet specification 2 extensions enabling or disabling 2 following same pattern 4 functions 2 general compilation path of applications 1 generating FONG 25222 rere ep a a 3 llvumir 3 CAP exe REV bu Pd 3 a binal 345824 doutes paires 1 GPU processing 1 host program 2 ICD code requirements 1 implementations use clGetPlatformlds function 1 use clGetPlatforminfo function L Installable Client Driver ICD 1 introductory sample C bindings 14 kernel compiling 1 kernel symbols not visible in debugger 2 list of available implementations 1 of commands 5 mapping 3 memory domains 5 metadata structures 2 minimalist C program sample 10 optional extensions q kernel program 2 performance libraries components 1 profiling components 1 printf capabilities 12 programmers introductory sample 14 programming model 3 allocating memory
177. rent kernel types interleaved on the same hardware Programmable GPU compute devices execute various user developed programs known to graphics programmers as shaders and to compute programmers as kernels These GPU compute devices can execute non graphics functions using a data parallel programming model that maps executions onto compute units Each compute unit contains one pre GCN devices or more GCN devices vector SIMD units In this programming model known as AMD Accelerated Parallel Processing arrays of input data elements stored in memory are accessed by a number of compute units Each instance of a kernel running on a compute unit is called a work item Work items are mapped to an n dimensional index space called an NDRange The GPU schedules the range of work items onto a group of processing elements until all work items have been processed Subsequent kernels then can be executed until the application completes A simplified view of the AMD Accelerated Parallel Processing programming model and the mapping of work items to processing elements is shown in Figure 2 2 Compute Unit 0 Processing Elements ND RANGE Figure 2 2 Simplified Mapping of OpenCL onto AMD Accelerated Parallel P
178. ributes swa swa colormap XCreateColormap displayName RootWindow displayName vi gt screen vi gt visual AllocNone 0 swa border pixel StructureNotifyMask swa event mask E 2 Linux Operating System E 9 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Window win XCreateWindow displayName RootWindow displayName vi gt screen 10 10 WINDOW WIDTH WINDOW HEIGHT 0 vi gt depth InputOutput vi gt visual CWBorderPixel CWColormap CWEventMask aswa XMapWindow displayName win std cout lt lt glXCreateContextAttribsARB lt lt void glXGetProcAddress const GLubyte glXCreateContextAttribsARB lt lt std endl GLXCREATECONTEXTATTRIBSARBPROC glXCreateContextAttribsARB GLXCREATECONTEXTATTRIBSARBPROC glXGetProcAddress const GLubyte glXCreateContextAttribsARB int attribs GLX CONTEXT MAJOR VERSION ARB 3 GLX CONTEXT MINOR VERSION ARB 0 0 E na GLXContext ctx glXCreateContextAttribsARB displayName fbc 0 true attribs glXMakeCurrent displayName win ctx l context properties cpsGL CONTEXT PLATFORM cl context properties platform GLX DISPLAY KHR intptr t glXGetCurrentDisplay GL CONTEXT KHR intptr t gGlCtx 0 nana status clGetGLConte
179. rmap XCreateColormap displayName RootWindow displayName vi gt screen vi gt visual AllocNone swa border pixel 0 swa event_mask StructureNotifyMask win XCreateWindow displayName RootWindow displayName vi gt screen 10 10 width height 0 vi gt depth InputOutput vi gt visual CWBorderPixel CWColormap CWEventMask amp Swa XMapWindow displayName win Appendix E OpenCL OpenGL Interoperability Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING int attribs GLX CONTEXT MAJOR VERSION ARB 3 GLX CONTEXT MINOR VERSION ARB 0 0 d GLXContext ctx glXCreateContextAttribsARB displayName fbc 0 true attribs glXMakeCurrent displayName win Gb gGlCtx glXGetCurrentContext properties cpsGL CL CONTEXT PLATFORM cl context properties platform CL GLX DISPLAY KHR intptr t glXGetCurrentDisplay CL GL CONTEXT KHR intptr t gGlCtx 0 size t deviceSize 0 status clGetGLContextInfoKHR CpsGL CL CURRENT DEVICE FOR GL CONTEXT KHR O NULL amp deviceSize int numDevices deviceSize sizeof cl device id if numDevices 0 glXDestroyContext glXGetCurrentDisplay gGlCtx continue else Interoperable device found std cout lt lt Interoperable device
180. rn 0 return 1 xxx x set primes kernel x this kernel fills the boolean primes array to reflect the entry in x input array is prime KKK kernel void set primes kernel global int in global int primes int id get global id 0 primes id 0 if isPrime in id primes id 1 xxx x get primes kernel x this kernel fills output array to reflect the entry in Chapter 6 OpenCL 2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING x input array is prime kernel void get primes kernel global int in global int output global int outPrimes f int id get global id 0 int k output id 1 if id 0 if output id 1 outPrimes k in id return if output id 1 k outPrimes k in id xxx x group scan kernel x this kernel takes an input array in and produces a work group level scan in out KKK kernel void group scan kernel global int in global int out int in data int i get global id 0 in data in i out i work group scan inclusive add in data 6 4 Device side enqueue 6 11 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING xxx x global scan kernel x takes a work group scanned array out from group scan kernel and gives globally scanned arr
181. rnel running on compute unit 2 mapping onto n dimensional grid ND Range 3 to stream cores d non active 4 processing wavefront 4 reaching point barrier in the code 2 scheduling for execution 3 the range of 2 spawn order 1 synchronization through barrier operations 3 through fence operations 3 VLIW instruction 4 work items divergence in wavefront 4 pipelining on a stream core 4 X X Window system using for CL GL interoperability 8 Index 15 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Index 16 Copyright 2013 Advanced Micro Devices Inc All rights reserved
182. rocessing Chapter 2 AMD Implementation Copyright 2013 Advanced Micro Devices Inc All rights reserved Dim Y AMD ACCELERATED PARALLEL PROCESSING Work groups are assigned to CUs All work items of a work group can be processed only by the processing elements of a single CU A processing element can process only one work item at a time however a CU can process multiple work groups OpenCL maps the total number of work items to be launched onto an n dimensional grid ND Range The developer can specify how to divide these items into work groups AMD GPUs execute on wavefronts groups of work items executed in lock step in a compute unit there is an integer number of wavefronts in each work group Thus as shown in Figure 2 3 hardware that schedules work items for execution in the AMD Accelerated Parallel Processing environment includes the intermediate step of specifying wavefronts within a work group This permits achieving maximum performance on AMD GPUs For a more detailed discussion of wavefronts see Section 1 1 Terminology page 1 WORK GROUP gt E a O KS KA Ko WORK ITEM Wavefront HW Specific Size Dimension X Figure 2 3 Work Item Grouping Into Work Groups and Wavefronts 2 1 The AMD Accelerated Parallel Processing Implementation of OpenCL 2 3 Copyright 2013 Advanced Micro Devices Inc All ri
183. rograms Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING 3 2 2 Running Code on Linux The following steps ensure the execution of OpenCL applications on Linux 1 The path to 1ibOpenCL so SAMDAPPSDKROOT 1ib x86 must be included in SLD LIBRARY PATH 2 usr lib OpenCL vendors must have libatioc132 so and or libatiocl64 so 3 Generally the path to the kernel file Template Kernel cl specified in the host program is relative to the executable Unless an absolute path is specified the kernel file must be in the same directory as the executable 3 3 Calling Conventions For all Windows platforms the stacal11 calling convention is used Function names are undecorated For Linux the calling convention is cdecl 3 3 Calling Conventions 3 7 Copyright 2013 Advanced Micro Devices Inc All rights reserved 3 8 AMD ACCELERATED PARALLEL PROCESSING Chapter 3 Building and Running OpenCL Programs Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Chapter 4 Debugging OpenCL This chapter discusses how to debug OpenCL programs running on AMD Accelerated Parallel Processing GPU and CPU compute devices The preferred method is to debug with the AMD CodeXL as described in Section 4 1 AMD CodeXL GPU Debugger The second method described in Section 4 2 Debugging CPU Kernels with GDB is to use exper
184. s 7 scheduling GPU I 10 work items for execution 3 range cess Sete AA 2 scope Global eesse seraa dada ssasedestesra 4 SDKUtil library 3 linking options 3 EIDUX rg ee doses ces 3 set a breakpoint 2 shader architecture unified 2 shaders and kernels 2 SIMD arrays processing elements 8 simple buffer write code sample 12 example programs 10 simple testing programming techniques parallel min function 19 single device associated with command queue 4 single stream core execution 10 single precision floating point performing operations 3 4 software Overview 1 spawn order of work item 1 sequential 1 stalls memory fetch request 11 static C kernel language iii 1 stdout stream writing output associated with the host appli CANON AA 12 stream core compute units 3 executing kernels 3 Ln c T 4 instruction sequence 3 processing elements 3 Stallone M R eR NEAR 11 due to data dependency 12 stream kernel 10 supplemental compiler options 6
185. s 10 memory access stream cores 9 memory commands 6 metadata structures holding debug information 2 OpenCL specific information 2 minGW GDB running 4 minGW cygwin 1 multi GPU environment use of GLUT 7 N namespaces C support for 4 supported feature in C 1 naming conventions API extension functions 1 elements contained in extensions 1 enumerated values 1 extensions 1 Khronos Group approved 1 provided by a specific vendor 1 provided by multiple vendors 1 n dimensional grid ND Range 3 n dimensional index space NDRange 2 NDRange accumulation operations 1 defnition 1 input streams 1 n dimensional index space 2 random access functionality 1 reduction operations 1 variable output counts 1 non active work items 4 normalized addresses 10 Index 10 NULL and non Null return values extension functions O 00 compiler option object files linking open platform strategy AMD Accelerated Parallel Processing Open
186. s gHwnd CreateWindow reinterpret cast LPCSTR SimpleGL hDC GetDC pfmt Choo ret SetPi hRC wglCr reinterpret cast lt LPCSTR gt SimpleGL WS CAPTION T WS POPUPWINDOW WS VISIBLE 0 0 screenWidth screenHeight NULL NULL windowclass hInstance NULL gHwnd sePixelFormat hDC amp pfd xelFormat hDC pfmt amp pfd ateContext hDC ret wglMa cl context keCurrent hDC ARC properties properties status cl CL CONTEXT PLATFORM cl context properties platform CL GL CONTEXT KHR cl context properties hRC CL WGL HDC KHR cl context properties hDC 0 GetGLContextInfoKHR properties CL CURRENT DEVICE FOR GL CONTEXT KHR sizeof cl device id amp interopDevice NULL Create OpenCL context from device s id context clCreateContext properties 1 amp interopDevice 0 0 amp status Multi GPU Environment 5 1 2 1 Creating CL context from a GL context E 4 Do not to use the GLUT windowing system in multi GPU environment because it always creates a GL context on the primary display and it is not possible to specify which display device to select for a GL context To use Win32 API for windowing in multi GPU environment 1 Detect each display by using EnumDisplayDevices function This function lets you obtain the information about di
187. s sRGB values into linear RGB Converting sRGB into RGB in the kernel explicitly is not necessary if the device supports OpenCL 2 0 Note that only read imagef can be used for reading sRGB image data because only the CL UNOR INT8 data type is supported with OpenCL 2 0 The following is a kernel sample that illustrates how to read an sRGB image object Read sRGBA image object input and convert it to linear RGB values results kernel void sample kernel read only image2d t input sampler t imageSampler global float xOffsets global float yOffsets global float4 results input sRGBA image object f int tidX get global id 0 tidY get global id 1 int offset tidY get image width input tidX int2 coords int2 xOffsets offset yOffsets offset results offset read imagef input imageSampler coords 6 26 Chapter 6 OpenCL 2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING OpenCL 2 0 does not include writing sRGB images directly but provides the cl khr srgb image writes extension The AMD implementations do not support this extension as of this writing In order to write sRGB pixels in a kernel explicit conversion from linear RGB to sRGB must be implemented in the kernel clFillImage is an exception for writing sRGB image directly The AMD OpenCL platform supports c1Fi111mage for filling linear RGB image to sRGB image directly
188. s Approved Supported extensions approved by the Khronos Group are e cl khr global int32 base atomics basic atomic operations on 32 bit integers in global memory e cl khr global int32 extended atomics extended atomic operations on 32 bit integers in global memory e cl khr local int32 base atomics basic atomic operations on 32 bit integers in local memory e cl khr local int32 extended atomics extended atomic operations on 32 bit integers in local memory e cl khr int64 base atomics basic atomic operations on 64 bit integers in both global and local memory e cl khr int64 extended atomics extended atomic operations on 64 bit integers in both global and local memory A 5 Getting Extension Function Pointers A 3 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING e cl khr 3d image writes supports kernel writes to 3D images e cl khr byte addressable store this eliminates the restriction of not allowing writes to a pointer or array elements of types less than 32 bit wide in kernel program e cl khr gl sharing allows association of OpenGL context or share group with CL context for interoperability e cl khr icd the OpenCL Installable Client Driver ICD that lets developers select from multiple OpenCL runtimes which may be installed on a system This extension is automatically enabled as of SDK v2 for AMD Accelerated Pa
189. s on the host the tree might have required to be transformed into arrays for GPU compute The values provided in parentheses indicate the extra time required for such computes Alternatively offsets might have been required to be used instead of pointers on both the host and the device at the cost of a few more additions Finally more than 5M nodes could not be allocated in 1 2 as the allowable memory allocation was limited by the amount of memory that could be used on the device Overall the 2 0 version exceeds the 1 2 version in both performance and usability 6 3 Generic Address Space 6 3 1 Overview In OpenCL 1 2 all parameters in a function definition must have address spaces associated with them The default address space is the private address space This necessitates creating an explicit version of the function must be created for each desired address space OpenCL 2 0 introduces a new address space called the generic address space Data cannot be stored in the generic address space but a pointer to this space can reference data located in the private local or global address spaces A function with generic pointer arguments may be called with pointers to any address space except the constant address space Pointers that are declared without pointing to a named address space point to the generic address space 6 3 Generic Address Space 6 5 Copyright 2013 Advanced Micro Devices Inc All rights reserved 6 3 2 Usag
190. se kernel names leads to the correct overloaded kernel 5 4 Examples 5 7 Copyright 2013 Advanced Micro Devices Inc All rights reserved 5 4 3 5 8 AMD ACCELERATED PARALLEL PROCESSING Kernel Template This example defines a kernel template testAdd It also defines two explicit instants of the kernel template testAddFloat4 and testAddInt8 The names testAddFloat4 and testAddInt8 are the external names for the two kernel template instants that must be used as parameters when calling to the clCreateKernel API template lt class T gt kernel void testAdd global T srcl global T src2 global T dst int tid get global id 0 dst tid srcl tid src2 tid template 5 attribute mangled name testAddFloat4 kernel void testAdd global float4 srcl global float4 src2 global float4 dst template attribute mangled name testAddInt8 kernel void testAdd global int8 srcl global int8 src2 global int8 dst Chapter 5 OpenCL Static C Programming Language Copyright 2013 Advanced Micro Devices Inc All rights reserved 6 1 AMD ACCELERATED PARALLEL PROCESSING Chapter 6 OpenCL 2 0 Introduction The OpenCL 2 0 specification is a significant evolution of OpenCL It introduces features that allow closer collaboration between the host and OpenCL devices such as Shared Virtual Memory SVM and device side enqueue Other features such as pipes dynamic parallelism and new image r
191. sing elements PEs Each PE contains one ALU Each SIMD unit simultaneously executes a single operation across 16 work items but each can be working on a separate wavefront Chapter 2 AMD Implementation Copyright 2013 Advanced Micro Devices Inc All rights reserved Asynchronous Compute Engine Asynchronous Compute Engine Command Processor Command Processor v d 1 Scalar Unit 4 Vector Unit LI LDS LDS L1 4 Vector Unit 1Scalar Unit lle E 1Scalar Unit 4 Vector Unit L1 LDS LDS L1 4 Vector Unit 1 Scalar Unit 9 o a S 1 Scalar Unit 4VectorUnit L1 LDS LDS L1 4 Vector Unit 1ScalarUnit E o 4 1 Scalar Unit 4 Vector Unit L1 LDS Fa LDS L1 4 Vector Unit 1 Scalar Unit p o d 1 Scalar Unit 4 VectorUnit L1 LDS E LDS L1 4vectorUnit 1 Scalar Unit dlg 5 S 1Scalar Unit 4 Vector Unit L1 LDS E LDS L1 4 Vector Unit 1 Scalar Unit Q o G E 1 Scalar Unit 4VectorUnit L1 LDS a LDS L1 4VectorUnit 1ScalarUnit E 2 4 1 Scalar Unit 4 Vector Unit L1 LDS E LDS L1 4Vector Unit 1ScalarUnit D 2 1 Scalar Unit 4 Vector Unit L1 LDS LDS L1
192. sion for the same problem Unlike OpenCL 2 0 OpenCL 1 2 does not provide workgroup built ins to perform prefix sum efficiently While primes can be in the same manner as in OpenCL 2 0 the group and global scan contain redundant computations of the order of O n for each work item as illustrated in the following basic OpenCL 1 2 implementation kernel void get primes kernel global int in global int primes global int outPrimes int id get global id 0 ant Id L if primes id idx 0 for i 0 i lt id i idx primes i outPrimes idx in id Thus in OpenCL 1 2 each work item computes the prefix sum for its index and stores the input prime into the out Primes array Clearly more work is performed for each work item compared to OpenCL 2 0 The following graph and the table compare the performance of OpenCL 2 0 version with that of the OpenCL 1 2 version Size OpenCL 1 2 Secs OpenCL 2 0 Secs 1024 0 000365 0 000495 2048 0 000588 0 0006 4096 0 00115 0 0005 10000 0 003 0 0011 20000 0 00692 0 0012 Chapter 6 OpenCL 2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING 40000 0 02222 0 00165 80000 0 07897 0 00254 160000 0 29292 0 00391 320000 1 1427 0 00614 6640000 4 5203 0 010622 128000 17 991 0 01928 Extract Primes 1 2 vs 2 0 p pa 1024 2048 40
193. source language is considered to be OpenCL C as defined in the following sections of the this document 5 3 2 Classes and Derived Classes OpenCL C is extended to support classes and derived classes as per Sections 9 and 10 of the C language specification with the limitation that virtual functions and abstracts classes are not supported The virtual keyword is reserved and the OpenCL C compiler is required to report a compile time error if it is used in the input program This limitation restricts class definitions to be fully statically defined There is nothing prohibiting a future version of OpenCL C from relaxing this restriction pending performance implications A class definition can not contain any address space qualifier either for members or for methods 1 The OpenCL Programming Language 1 2 Rev15 Khronos 2011 5 3 Additions and Changes to Section 6 The OpenCL C Programming Language 5 3 Copyright 2013 Advanced Micro Devices Inc All rights reserved 5 3 3 5 3 4 5 4 AMD ACCELERATED PARALLEL PROCESSING class myClass public int myMethodl return x void local myMethod2 x 0 private int x __local y illegal The class invocation inside a kernel however can be either in private or local address space __kernel void myKernel myClass cl local myClass c2 Classes can be passed as arguments to kernels by defining a buffer object at the size of the class
194. splay devices in the current session Appendix E OpenCL OpenGL Interoperability Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING To query all display devices in the current session call this function in a loop starting with DevNum set to O and incrementing DevNum until the function fails To select all display devices in the desktop use only the display devices that have the DISPLAY DEVICE ATTACHED TO DESKTOP flag in the DISPLAY DEVICE structure To get information on the display adapter call EnumDisplayDevices with lpDevice set to NULL For example DISPLAY DEVICE DeviceString contains the adapter name Use EnumDisplaySettings to get DEVMODE dmPosition x and dmPosition y are used to get the x coordinate and y coordinate of the current display Try to find the first OpenCL device winner associated with the OpenGL rendering context by using the loop technique of 2 above Inside the loop a Create a window on a specific display by using the CreateWindow function This function returns the window handle HWND b Use GetDc to get a handle to the device context for the client area of a specific window or for the entire screen OR Use the CreateDc function to create a device context HDC for the specified device c Use ChoosePixelFormat to match an appropriate pixel format supported by a device co
195. ss unit finishes accessing memory Meanwhile other work items can be active within the compute unit contributing to better performance The data fetch units handle three basic types of memory operations loads stores and streaming stores GPU compute devices can store writes to random memory locations using global buffers 1 5 Memory Architecture and Access 1 9 Copyright 2013 Advanced Micro Devices Inc All rights reserved 1 5 4 1 5 5 AMD ACCELERATED PARALLEL PROCESSING Global Memory The global memory lets applications read from and write to arbitrary locations in memory When using global memory such read and write operations from the stream kernel are done using regular GPU compute device instructions with the global memory used as the source or destination for the instruction The programming interface is similar to load store operations used with CPU programs where the relative address in the read write buffer is specified When using a global memory each work item can write to an arbitrary location within it Global memory use a linear layout If consecutive addresses are written the compute unit issues a burst write for more efficient memory access Only read only buffers such as constants are cached Image Read Write Image reads are done by addressing the desired location in the input memory using the fetch unit The fetch units can process either 1D or 2 D addresses These addresses can be normalized or un norm
196. t cl kernel sub group info See clGetKernelSubGrouplnfoKHR F 3 2 New Macros CL_INVALID_PIPE_SIZE CL_INVALID_DEVICE_QUEUE CL VERSION 2 0 CL DEVICE QUEUE ON HOST PROPERTIES CL DEVICE MAX READ WRITE IMAGE ARGS CL DEVICE MAX GLOBAL VARIABLE SIZE CL DEVICE QUEUE ON DEVICE PROPERTIES CL DEVICE QUEUE ON DEVICE PREFERRED SIZE CL DEVICE QUEUE ON DEVICE MAX SIZE CL DEVICE MAX ON DEVICE QUEUES CL DEVICE MAX ON DEVICE EVENTS CL DEVICE SVM CAPABILITIES CL DEVICE GLOBAL VARIABLE PREFERRED TOTAL SIZE CL DEVICE MAX PIPE ARGS CL DEVICE PIPE MAX ACTIVE RESERVATIONS CL DEVICE PIPE MAX PACKET SIZE CL DEVICE PREFERRED PLATFORM ATOMIC ALIGNMENT CL DEVICE PREFERRED GLOBAL ATOMIC ALIGNMENT CL DEVICE PREFERRED LOCAL ATOMIC ALIGNMENT CL QUEUE ON DEVICE CL QUEUE ON DEVICE DEFAULT CL DEVICE SVM COARSE GRAIN BUFFER CL DEVICE SVM FINE GRAIN BUFFER CL DEVICE SVM FINE GRAIN SYSTEM F 6 Appendix F New and deprecated functions in OpenCL 2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING CL DEVICE SVM ATOMICS CL QUEUE SIZE CL MEM SVM FINE GRAIN BUFFER CL MEM SVM ATOMICS CL sRGB CL sRGBx CL sRGBA CL sBGRA CL ABGR CL MEM OBJECT PIPE CL MEM USES SVM POINTER CL PIPE PACKET SIZE CL PIPE MAX PACKETS CL SAMPLER MIP FILTER MODE CL SAMPLER LOD MIN CL SAMPLER LOD MAX CL PROGRAM BUILD GLOBAL VARIABLE TOTAL SIZE CL KERNEL ARG TYPE PIPE CL KERNEL EX
197. t 16 amp OxFF srcl i gt gt 24 amp OxFF amp 31 AMD ACCELERATED PARALLEL PROCESSING Built in function amd sadhi uintn amd sadhi uintn src0 uintn srcl uintn src2 Return value for each vector component src2 i abs srcO i gt gt 0 amp OxFF srcl i gt gt 0 amp OxFF lt lt 16 abs srcO i gt gt 8 amp OxFF srcl i gt gt 8 amp OxFF lt lt 16 abs srcO i gt gt 16 amp OxFF srcl i gt gt 16 amp OxFF lt lt 16 abs srcO i gt gt 24 amp OxFF srcl i gt gt 24 amp OxFF lt lt 16 For more information see http www khronos org registry cl extensions amd cl amd media ops txt A 8 10 cl amd media ops2 This extension adds further built in functions to those of c1 amd media ops When enabled it adds the following built in functions to the OpenCL language Note typen denotes an open scalar type n 1 and vector types n 2 4 8 16 Built in Function uintn amd msad uintn src0 uintn srcl uintn src2 Description uchar4 src0u8 as uchar4 src0 s0 uchar4 srclu8 as uchar4 srcl s0 dst s0 src2 s0 srclu8 s0 0 O abs srcOu8 s0 srclu8 s0 srclu8 sl 0 O abs src0u8 s1 srclu8 sl srclu8 s2 0 0 abs src0u8 s2 srclu8 s2 srclu8 s3 0 O abs srcOu8 s3 srclu8 s3 A similar operation is applied to other components of the vectors Built in Fu
198. t mask is set true for lanes elements items in which x is true then execute A The mask then is inverted and B is executed Example 1 If two branches A and B take the same amount of time t to execute over a wavefront the total time of execution if any work item diverges is 2f Loops execute in a similar fashion where the wavefront occupies a compute unit as long as there is at least one work item in the wavefront still being processed Thus the total execution time for the wavefront is determined by the work item with the longest execution time Example 2 If t is the time it takes to execute a single iteration of a loop and within a wavefront all work items execute the loop one time except for a single work item that executes the loop 100 times the time it takes to execute that entire wavefront is 100t 2 1 The AMD Accelerated Parallel Processing Implementation of OpenCL 2 5 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING 22 Hardware Overview for GCN Devices 2 6 A general OpenCL device comprises compute units CUs each of which has sub modules that ultimately have ALUs A work item or SPMD kernel instance executes on an ALU as shown in Figure 2 4 eme nii Figure 2 4 Generalized AMD GPU Compute Device Structure for GCN Devices In GCN devices each CU includes one Scalar Unit and four Vector SIMD units each of which contains an array of 16 proces
199. t refcount of event Decrement refcount of event Create user event Check if event is valid Signal user event Schedule capture of profiling info Get default queue Create 1D NDRange Create 2D NDRange Create 3D NDRange F 3 Copyright 2013 Advanced Micro Devices Inc All rights reserved F 1 10 AMD ACCELERATED PARALLEL Sub groups get_sub_group_size get_max_sub_group_size get_num_sub_groups get_enqueued_num_sub_groups get_sub_group_id get_sub_group_local_id sub_group_barrier sub group all sub group any sub group broadcast sub group reduce add sub group reduce max sub group reduce min sub group scan exclusive add sub group scan exclusive max sub group scan exclusive min sub group scan inclusive add sub group scan inclusive max sub group scan inclusive min sub group reserve read pipe sub group reserve write pipe sub group commit read pipe sub group commit write pipe get kernel sub group count for ndrange get kernel max sub group size for ndra nge F 2 Deprecated built ins F 4 barrier mem fence read mem fence PROCESSING Get size of current sub group Get size of largest sub group Get number of sub groups in current work group Get number of sub groups in uniform work group Get id of current sub group Get ID of workitem in sub group Sub group barrier Test all members of work group and reduction Test any member of work group or reduction Brodcast value to every memb
200. taken from C 11 are relaxed acquire release acquire release and sequential consistent OpenCL 2 0 introduces a new C 11 based set of atomic operations with specific memory model based semantics Atomic operations are indivisible a thread or agent cannot see partial results The atomic operations supported are e atomic load store e atomic init e atomic work item fence e atomic exchange e atomic compare exchange e atomic fetch op Where op is add sub xor and or or OpenCL 2 0 introduces the concept of memory scope which limits the extent to which atomic operations are visible For example e workgroup scope means that the updates are to be visible only within the work group e device scope means that the updates are to be visible only within the device across workgroups within the device Chapter 6 OpenCL 2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved 6 2 2 Usage AMD ACCELERATED PARALLEL PROCESSING e all svm devices scope means the updates are available across devices GPUs and the host CPU The OpenCL 2 0 further differentiates between coarse grained and fine grained buffer and system SVM buffer sharing mechanisms These mechanisms define the granularity at which the SVM buffers are shared Updates to coarse grained or fine grained SVM are visible to other devices at synchronization points e For coarse grained SVM the synchron
201. text to get the current rendering GL context HGLRC of the calling thread 3 Use wglGetCurrentrC to get the device context HDC that is associated with the current OpenGL rendering context of the calling thread 4 Use the clGetGLContextInfoKHR See Section 9 7 of the OpenCL Specification 1 1 function and the CL CURRENT DEVICE FOR GL CONTEXT KHR parameter to get the device ID of the CL device associated with OpenGL context 5 Use clCreateContext See Section 4 3 of the OpenCL Specification 1 1 to create the CL context of type c1 context The following code snippet shows you how to create an interoperability context using GLUT on single GPU system glutInit amp argc argv glutInitDisplayMode GLUT RGBA GLUT DOUBLE glutInitWindowSize WINDOW WIDTH WINDOW HEIGHT glutCreateWindow OpenCL SimpleGL HGLRC glCtx wglGetCurrentContext Cl context properties cpsGL CL CONTEXT PLATFORM cl context properties platform CL WGL HDC KHR intptr t wglGetCurrentDC CL GL CONTEXT KHR intptr t glCtx 0 status clGetGLContextInfoKHR CpsGL CL CURRENT DEVICE FOR GL CONTEXT KHR sizeof cl device id amp interopDevice NULL Create OpenCL context from device s id context clCreateContext cpsGL r amp interopDevice 0 0 amp status Using Win32 API 1 Use CreateWindow for window creation and get
202. th X and Y vectors are stored in global memory X is read only Y is read write kernel void saxpy const global float X global float Y const float a uint gid get global id 0 Y gid a X gid Y gid 3 List all platforms on the machine then select one cl Platform get amp platforms 4 Create an OpenCL context on that platform cl context properties cps 3 CL CONTEXT PLATFORM cl context properties iter 0 0 context cl Context CL DEVICE TYPE GPU cps 5 Get OpenCL devices from the context devices context getInfo lt CL CONTEXT DEVICES 6 Create an OpenCL command queue queue cl CommandQueue context devices 0 7 Create two buffers corresponding to the X and Y vectors Ensure the host side buffers pX and pY are allocated and initialized The CL MEM COPY HOST PTR flag instructs the runtime to copy over the contents of the host pointer pX in order to initialize the buffer bufX The bufX buffer uses the CL MEM READ ONLY flag while bufY requires the CL MEM READ WRITE flag bufX cl Buffer context CL MEM READ ONLY CL MEM COPY HOST PTR sizeof cl float length pX 8 Create a program object from the kernel source string build the program for our devices and create a kernel object corresponding to the SAXPY kernel At this point it is possible to create multiple kern
203. the CLK IMAGE MEM FENCE flage and the memory scope work item memory scope is required between reads and writes to the same image to ensure that the writes are visible to subsequent reads If multiple work items are writing to and reading from multiple locations in an image a call to work group barrier with the CLK IMAGE MEM FENCE flag is required OpenCL 2 0 also allows 2D images to be created from a buffer or another 2D image and makes the ability to write to 3D images a core feature This extends the power of image operations to more situations The function clGetSupportedImageFormats returns a list of the image formats supported by the OpenCL platform The Image format has two parameters channel order and data type The following lists some image formats OpenCL supports Channel orders CL A CL RG CL RGB CL RGBA Channel data type CL UNORM INT48 CL FLOAT OpenCL 2 0 provides improved image support specially support for sRGB images and depth images sRGB is a standard RGB color space that is used widely on monitors printers digital cameras and the Internet Because the linear RGB value is used in most image processing algorithms Processing the images often requires converting sRGB to linear RGB OpenCL 2 0 provides a new feature for handling this conversion directly Note that only the combination of data type CL UNORM INT8 and channel order CL sRGBA is mandatory in Ope
204. tialize atomic value memory fence atomic store atomic load atomic exchange atomic compare exchange strong explicit atomic compare and exchange CAS atomic compare exchange weak explicit atomic fetch add explicit atomic fetch sub explicit atomic fetch or explicit atomic fetcn xor explicit atomic fetch and explicit atomic fetch max explicit atomic fetch min explicit atomic flag test and set explicit atomic flag clear explicit Image Read and Write Functions read imagef write imagef Work group functions work group all work group any work group broadcast work group reduce add work group reduce max work group reduce min work group scan exclusive add work group scan exclusive max atomic compare and exchange CAS atomic fetch add atomic fetch sub atomic fetch or atomic fetch xor atomic fetch and atomic fetch max atomic fetch min atomic flag set atomic flag clear Read from 2D depth array image Write to 2D depth array image Test all members of work group and reduction Test any member of work group or reduction Brodcast value to every member of work group Sum reduction across work group Max reduction across work group Min reduction across work group Sum exclusive scan across work group Max exclusive scan across work group Appendix F New and deprecated functions in OpenCL 2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved
205. tly large NLOOPS is chosen effects from kernel launch time and delayed buffer copies to the device by the CL runtime are minimized Note that while only a single c1 Finish is executed at the end of the timing run the two kernels are always linked using an event to ensure serial execution The bandwidth is expressed as number of input bytes processed For high end graphics cards the bandwidth of this algorithm is about an order of magnitude higher than that of the CPU due to the parallelized memory subsystem of the graphics card 7 The results then are checked against the comparison value This also establishes that the result is the same on both CPU and GPU which can serve as the first verification test for newly written kernel code 8 Note the use of the debug buffer to obtain some runtime variables Debug buffers also can be used to create short execution traces for each thread assuming the device has enough memory 9 You can use the Timer cpp and Timer h files from the TransferOverlap sample which is in the SDK samples Kernel Code 10 The code uses four component vectors uint4 so the compiler can identify concurrent execution paths as often as possible On the GPU this can be used to further optimize memory accesses and distribution across ALUs On the CPU it can be used to enable SSE like execution 11 The kernel sets up a memory access pattern based on the device For the CPU the source buffer is chopped
206. tor unit to be executed in parallel every cycle The number of Asynchronous Compute Engines ACEs and CUs in an AMD GPU and the way they are structured vary with the device family as well as with the device designations within a family The ACEs are responsible for managing the CUs and for scheduling and resource allocation of the compute tasks but not of the graphics shader tasks The ACEs operate independently the greater the number of ACEs the greater is the performance Each ACE fetches commands from cache or memory and creates task queues to be scheduled for execution on the CUs depending on their priority and on the availability of resources Each ACE contains up to 8 queues and together with the graphics command processor allows up to nine independent vector instructions to be executed per clock cycle Some of these queues are not available for use by OpenCL Devices in the Southern Islands families typically have two ACEs Devices in the Sea Islands and Volcanic Islands families contain between four and eight ACEs so they offer more performance For example the AMD Radeon R9 290X devices in the VI family contain 8 ACEs and 44 CUs 2 3 Communication Between Host and the GPU Compute Device 2 8 The following subsections discuss the communication between the host CPU and the GPU in a compute device This includes an overview of the PCle bus processing API calls and DMA transfers Communication and data transfers b
207. ualified with OpenCL C address spaces i e global local private and constant OpenCL C kernels defined with kernel can be templated and can be called from within an OpenCL C C program or as an external entry point from the host For kernel templates the following syntax is used as part of the kernel name assuming a kernel called foo fooxtype type where type type must be either OpenCL scalar or vector type or can be a user defined type that is allocated in the same source file as the kernel foo In this case a kernel is both overloaded and templated foo lt type type gt Eype 41 7 EVPEm Note that here overloading resolution is done by first matching non templated arguments in order of appearance in the definition then substituting template parameters This allows intermixing of template and non template arguments in the signature To support template kernels the same mechanism for kernel overloading is used Use the following syntax __attribute mangled name myMangledName The kernel mangled name is used as a parameter to passed to t he clCreateKernel API This mechanism is needed to allow template kernels 5 3 Additions and Changes to Section 6 The OpenCL C Programming Language 5 5 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING without changing the existing OpenCL kernel creation API An implementatio
208. urrent LDS model 2 context relationship sample code 4 contexts associating CL and GL 1 copying data implicit and explicit 6 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING copying processes r CPU binarles cs ean Rp e ERA 1 code parallel min function 18 communication between host and GPU 8 predefined macros 13 processing 2 LVM AS 22 daube ses aliens 2 OpenCL runtime 1 skip copying between host memory and PCle memory esce a d tente 7 Creating CL context from a GL Context 10 cygwin GDB running 4 cygwin minGW D D name OpenCL supported options 4 data computations select a device 4 fetch units 9 moving using corresponding command queue 4 parallelism GIQUPING sss petevy ks E peas 2 transfers select a device 4 to the optimizer 2 dataflow between host CPU and GPU 6 data parallel granularity compute kernels 2 data parallel programming model executing non graphic functions 2 debug information creating metadata structures 2 debugger CodeXL GPU 1 GNU pro
209. verview 1 Index 7 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Hawaii see RI 290X series devices or AMD Radeon R9 290X daan aala aden aod Per bng 8 header fields IN ELF cosa oe ewes ad SAGAD DANG BAS 2 hello world sample kernel 2 hierarchical subdivision OpenCL data parallel programming model 2 host communication between host and GPU 8 copying data from host to GPU 6 dataflow between host and GPU 6 program OpenCL 2 program compiling 2 host code breakpoint aa ha Na waka px Han 2 platform vendor string 3 setting breakpoint 3 clEnqueueNDRangeKernel 3 host device architecture single platform consisting ofa GPU and CPU 3 l dir OpenCL supported options 4 idle stream cores 4 IL complete says 4 KANA a a EG 1 incomplete 1 image reads 10 implicit copying of data 6 index space n dimensional 2 inheritance strict and multiple 1 input stream NDRange 1 Installable Client Driver ICD 1 AMD Accelerated Parallel Processing software stack 1 compliant version of code 1 2 overview
210. wa ad wad AA 10 C programming OpenGL de eaaa a eee 3 C API 0022 aa 2 C bindings OpenCL programming 14 C extension unsupported features 2 C files compiling 3 C kermel language iii 1 C kernels building maam rrea tae readies dda 3 C templates 5 cache Lise fac BAO ee eee A T NADA 10 Dos ioi bbs bd qd BAS aie aa E 10 texture system 10 call error OpenCL 16 calling convention LINUX iiu mb xm RE guet 7 Windows 2 212 Gv bode AALAGA ADAN LAG 7 character extensions 1 searching for substrings 2 character sequence format string 12 CL context associate with GL context 1 CL kernel function breakpoint 3 CL options change during runtime 5 cl amd device attribute query extension querying AMD specific device attributes 5 cl amd event callback extension registering event callbacks for states 6 cl amd fp64 extension 4 cl amd media ops extension adding built in functions to OpenCL language 7 9 cl amd printf extension 12 cl ext extensions 4 cl khr fp64 supported function 15 classes passing between host and device 3 clBuildProgram debugging OpenCL
211. xtInfoKHR cpsGL CL CURRENT DEVICE FOR GL CONTEXT KHR sizeof cl device id amp interopDeviceld NULL Create OpenCL context from device s id context clCreateContext cpsGL 1 sinteropDeviceld 0 0 amp status E 2 2 Multi GPU Configuration 5 2 2 1 Creating CL Context from a GL Context Using X Window System E 10 Appendix E OpenCL OpenGL Interoperability Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING 1 Use XOpenDi splay to open a connection to the server that controls a display Use ScreenCount to get the number of available screens 3 Use XCloseDisplay to close the connection to the X server for the display specified in the Display structure and destroy all windows resource IDs Window Font Pixmap Colormap Cursor and GContext or other resources that the client created on this display Use a FOR loop to enumerate the displays To change the display change the value of the environment variable DISPLAY 5 Inside the loop a Use putenv to set the environment variable DISPLAY with respect to the display number Use OpenDisplay to open a connection to the server that controls a display Use g1XChooseFBConfig to get a list of GLX frame buffer configurations that match the specified attributes Use g1XChooseVisual to get a visual that matches specified attributes Use XCreateColormap to create a co
212. y LLVM IR OpenCL Source INNER CPU GPU Figure 3 1 OpenCL Compiler Toolchain For CPU processing the OpenCL runtime uses the LLVM AS to generate x86 binaries The OpenCL runtime automatically determines the number of processing elements or cores present in the CPU and distributes the OpenCL kernel between them For GPU processing the OpenCL runtime post processes the AMD IL from the OpenCL compiler and turns it into complete AMD IL This adds macros from a macro database similar to the built in library specific to the GPU The OpenCL AMD Accelerated Parallel Processing OpenCL Programming Guide 3 1 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING Runtime layer then removes unneeded functions and passes the complete IL to the Shader compiler for compilation to GPU specific binaries 3 1 Compiling the Program 3 1 1 3 2 An OpenCL application consists of a host program C C and an optional kernel program c1 To compile an OpenCL application the host program must be compiled this can be done using an off the shelf compiler such as g or MSVC The application kernels are compiled into device specific binaries using the OpenCL compiler This compiler uses a standard C front end as well as the low level virtual machine LLVM framework with extensions for OpenCL The compiler starts with the OpenCL source that the user program passes through the O
213. yright 2013 Advanced Micro Devices Inc All rights reserved B 1 Overview B 2 Using ICD AMD ACCELERATED PARALLEL PROCESSING Appendix B The OpenCL Installable Client Driver ICD The OpenCL Installable Client Driver ICD is part of the AMD Accelerated Parallel Processing software stack Code written prior to SDK v2 0 must be changed to comply with OpenCL ICD requirements The ICD allows multiple OpenCL implementations to co exist also it allows applications to select between these implementations at runtime Use the clGetPlatformIDs and clGetPlatformInfo functions to see the list of available OpenCL implementations and select the one that is best for your requirements It is recommended that developers offer their users a choice on first run of the program or whenever the list of available platforms changes A properly implemented ICD and OpenCL library is transparent to the end user Sample code that is part of the SDK contains examples showing how to query the platform API and call the functions that require a valid platform parameter This is a pre ICD code snippet context clCreateContextFromType dType NULL NULL amp status The ICD compliant version of this code follows Have a look at the available platforms and pick either the AMD one if available or a reasonable default 7 cl uint numPlatforms cl platform id platform NULL status clGetPlatformIDs 0
214. zed chunks The NDRange is the size of the array divided by the chunk size Each work item checks whether the key is present in the range and if the key is present updates the output array The issue with the above approach is that if the input array is very large the number of work items NDRange would be very large The array is not divided into smaller more manageable chunks In OpenCL 2 0 the device enqueue feature offers clear advantages in binary search performance The kernel is rewritten in OpenCL 2 0 as follows for key count 0 key count lt no of keys key count If the element to be found does not lie between them then nothing left to do in this thread if elementLower gt keys key count elementUpper lt keys key count continue else However if th lement does lie between th lower and upper bounds of this thread s searchspace we need to narrow down the search further in this search space S subdivSize for keys key count subdivSize outputArray key count x parent globalids key count tid subdivSize for keys key count parent globalids key count 6 16 Chapter 6 OpenCL 2 0 Copyright 2013 Advanced Micro Devices Inc All rights reserved AMD ACCELERATED PARALLEL PROCESSING subdivSize for keys key count tid outputArray key count w 1 outputArray key count y subdivSize for keys key count globalLowerIndex tid subdivSize for

AMD Accelerated Parallel Processing OpenCL User Guide

Contents

Download Pdf Manuals

Related Search

Related Contents