Home

CUDA Debugger

1. 18 5 3 Switching FOCUS ier reta re ree Ee ex RE ke DES EREREXXAIRR TRU SEU I REND C KHU TR oi REO TRU ES 19 Chapter 6 Program Execution cere eee eoo sese secca e saa eic eat etitea e ei le seualieai d 20 6 1 Interrupting the Application cece eee eee eee cece eene ee ehe me ehe eene ene 20 6 2 Single Steppil9 iissc eres tena Rake EEUE E ER SENE ERA CREE TES ERES dea RS PESE 20 Chapter 7 Breakpoints amp Watchpoints 5 ode reporte teer roter aeree pee ase ones eee es eerie aS 22 7 1 Symbolic Breakpoints i 515065 s s ere thor vananev epee nane eph bas xen aa Spe vessels thee a CREE 22 7 2 LANE Breakpoittsz oce hacia tede mie tede user bat dos tnsceEnad NUMINUM ND FEN M SEMINE 23 7 3 Address Breakpoihts hindi coe erit sabes sages yy rr ansi RR Y SERO DES SREPRR PE PST EDEUPr e eR Yd 23 www nvidia com CUDA Debugger DU 05227 042 v6 0 ii 7 4 Kernel Entry Breakpoints 424245 SEE nnn enne enne 23 1 9 Conditional Breakpolnts serne e roro rear reor ride ae e EXER E EEEE V EFE STR kae enda et 23 7 6 Watchpaints ixvu osis rete tp eter stan VRREREE osad dried ER saan EVER MERE ERR VE occa VIR FRE Ee E es 24 Chapter 8 Inspecting Program State eeenveeeeonevseneenennen venna TEk hene 25 8 1 Memory and Variables certe eere then rr Ete kn exon TR EANNAN C RU ee UE ET ERE EXTR E Tu 25 8 2 Variable Storage and Accessibility eere
2. cuda gdb set cuda value extrapolation on The debugger will attempt to extrapolate the value of variables beyound their respecitve live ranges This setting may report erroneous values www nvidia com CUDA Debugger DU 05227 042 v6 0 52 Appendix A SUPPORTED PLATFORMS Host Platform Requirements CUDA GDB is supported on all the platforms supported by the CUDA toolkit with which it is shipped See the CUDA Toolkit release notes for more information GPU Requirements Debugging is supported on all CUDA capable GPUs with a compute capability of 1 1 or later Compute capability is a device attribute that a CUDA application can query about for more information see the latest NVIDIA CUDA Programming Guide on the NVIDIA CUDA Zone Web site These GPUs have a compute capability of 1 0 and are not supported gt GeForce 8800 GTS gt GeForce 8800 GTX gt GeForce 8800 Ultra gt Quadro Plex 1000 Model IV gt Quadro Plex 2100 Model S4 gt Quadro FX 4600 gt Quadro FX 5600 gt Tesla C870 Tesla D870 gt Tesla S870 www nvidia com CUDA Debugger DU 05227 042 v6 0 53 Appendix B KNOWN ISSUES The following are known issues with the current release gt Setting the cuda memcheck option ON will make all the launches blocking gt Device memory allocated via cudaMalloc is not visible outside of the kernel function gt ON GPUs with sm type lower than sm 20 it is not possible to step over a subr
3. g active 0x000000000000008c 0 0 0 al active 0x000000000000008c 3E c 017 0 2 active 0x000000000000008c 27 015 0 3 active 0x000000000000008c Oro 4 active 0x000000000000008c 4 0 0 5 active 0x000000000000008c 5070 6 active 0x000000000000008c 6 0 0 7 active 0x000000000000008c GA OOD 8 active 0x000000000000008c COPI 9 active 0x000000000000008c 95 05 0 10 active 0x000000000000008c 10 0 0 di active 0x000000000000008c 11 0 0 12 active 0x000000000000008c 12 0 0 13 active 0x000000000000008c 13 0 0 14 active 0x000000000000008c 14 0 0 ALS active 0x000000000000008c 15 0 0 16 active 0x000000000000008c 16 0 0 8 4 5 info cuda kernels This command displays on all the active kernels on the GPU in focus It prints the SM mask kernel ID and the grid ID for each kernel with the associated dimensions and arguments The kernel ID is unique across all GPUs whereas the grid ID is unique per GPU The Parent column shows the kernel ID of the parent grid This command supports filters and the default is kernel all www nvidia com CUDA Debugger DU 05227 042 v6 0 28 Inspecting Program State cuda gdb info cuda kernels Kernel Parent Dev Grid Status SMs Mask GridDim BlockDim Name Args p i 0 2 Active OO Osc 240 1 1 26l 1 acos maim parms This command will also show grids that have been launched on the GPU with Dynamic Parallelism Kernels with a negative grid ID have been launched
4. Set permissions The first time cuda gdb is executed a pop up dialog window will appear to allow the debugger to take control of another process The user must have Administrator priviledges to allow it It is a required step Another solution used in the past is to add the cuda binary gdb to the procmod group and set the taskgated daemon to let such processes take control of other processes It used to be the solution to fix the Unable to find Mach task port for processid error sudo chgrp procmod Developer NVIDIA CUDA 6 0 bin cuda binary gdb sudo chmod 2755 Developer NVIDIA CUDA 6 0 bin cuda binary gdb sudo chmod 755 Developer NVIDIA CUDA 6 0 bin cuda gdb To set the taskgated daemon to allow the processes in the procmod group to access Task Ports taskgated must be launched with the p option To make it a permanent option edit System Library LaunchDaemons com apple taskgated plist See man taskgated for more information Here is an example lt xml version 1 0 encoding UTF 8 gt sep o PRU UBS nre cr ADIPE PISIS TN QUA AES NISI oi AA ODS IA DEDSABrOPpesty ENS e Ode olus evers on IO dict lt key gt Label lt key gt lt string gt com apple taskgated lt string gt lt key gt MachServices lt key gt mere lt key gt com apple taskgated lt key gt lt dict gt lt key gt TaskSpecialPort lt key gt lt integer gt 9 lt integer gt eite diiler lt key gt ProgramArguments lt key gt lt arra
5. MEMCHECK This tool is also part of the toolkit Please read the related documentation for more information By default CUDA GDB will report any memory error See GPU Error Reporting for a list of the memory errors To increase the number of memory errors being reported and to increase the precision of the memory errors CUDA memcheck must be turned on 10 4 Autostep Description Autostep is a command to increase the precision of CUDA exceptions to the exact lane and instruction when they would not have been otherwise Under normal execution an exception may be reported several instructions after the exception occurred or the exact thread where an exception occurred may not be known unless the exception is a lane error However the precise origin of the exception can be determined if the program is being single stepped when the exception occurs Single stepping manually is a slow and tedious process stepping takes much longer than normal execution and the user has to single step each warp individually Autostep aides the user by allowing them to specify sections of code where they suspect an exception could occur and these sections are automatically and transparently single stepped the program is running The rest of the program is executed normally to minimize the slow down caused by single stepping The precise origin of an exception will be reported if the exception occurs within these sections Thus the exact instruction and thre
6. Quadro K5000 CUDA Driver Version Runtime Version 6 0 G Q CUDA Capability Major Minor version number S50 Total amount of global memory 4096 MBytes 4294508544 bytes 8 Multiprocessors x 192 CUDA Cores MP 1536 CUDA Cores truncated output e ae deviceQuery CUDA Driver CUDART CUDA Driver Version 6 0 CUDA Runtime Version NumDevs 2 Device0 GeForce GT 330M Devicel 8 0 X Quadro K5000 If Device 0 is rendering the desktop then Device 1 must be selected for running and debugging the CUDA application This exclusion of a device can be achieved by setting the CUDA VISIBLE DEVICES environment variable to the index of the device that will be used for CUDA In this particular example the value would be 1 export CUDA VISIBLE DEVICES 1 As a safeguard mechanism cuda gdb will detect if a visible device is also used for display and return an error To turn off the safeguard mechanism the set cuda gpu busy check should be set to off cuda gdb set cuda gpu busy check off 3 4 6 Remote Debugging There are multiple methods to remote debug an application with CUDA GDB In addition to using SSH or VNC from the host system to connect to the target system it is also possible to use the target remote GDB feature Using this option the local cuda gdb client connects to the cuda gdbserver process the server running on the target system This option is supported with a Linux or Mac OS X client an
7. 64 bit server gdbserver must be 32 bit www nvidia com CUDA Debugger DU 05227 042 v6 0 54 Known Issues gt Attaching to a CUDA application with Software Preemption enabled in cuda gdb is not supported gt Attaching to CUDA application running in MPS client mode is not supported gt Attaching to the MPS server process nvidia cuda mps server using cuda gdb or starting the MPS server with cuda gdb is not supported gt IfaCUDA application is started in the MPS client mode with cuda gdb the MPS client will wait untill all other MPS clients have terminated and will then run as non MPS application www nvidia com CUDA Debugger DU 05227 042 v6 0 55 Notice ALL NVIDIA DESIGN SPECIFICATIONS REFERENCE BOARDS FILES DRAWINGS DIAGNOSTICS LISTS AND OTHER DOCUMENTS TOGETHER AND SEPARATELY MATERIALS ARE BEING PROVIDED AS IS NVIDIA MAKES NO WARRANTIES EXPRESSED IMPLIED STATUTORY OR OTHERWISE WITH RESPECT TO THE MATERIALS AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE Information furnished is believed to be accurate and reliable However NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation Specifications mentio
8. Kernel Events mess va eere EIER Rhea eR ERAEN EAE EAEE EES 33 Chapter 10 Automatic Error Checking c eese eese eee eee eene enhn ehh rhone een 35 10 1 Checking API ETOT Si iii ve st mse m nde rrr nh titer ter sodas olend ise e rao Fel e Ea baled viile aja 35 10 2 GPU Error Reporting eese rea rhe veedavad a veaga kasaka aaa vanake deme ERARA SA E Reda aS 35 10 3 set cuda memdcheck edits virumaa erben een tenu eee lee A NER o eee E tee y dere di 37 JU Ej MER 38 Chapter 11 Walk Through Examples ccccccscccccccccccscccesscccesccceessccessscessseceesseees 40 11 1 Example bitreverse iieie eee opere ree uve e bir pe vr ev Peek v ooo gov Ere eie ale ER ara 40 11 1 1 Walking through the Code reereveeevaneo soo neao eero eea nter eee nae 41 11 2 Example autos tE Derren eerren reer EEEE EENAA estne eda meat demand ode 44 11 2 1 Debugging with Autosteps cseeeeeeeeeee eene enne ehh eterne nene 45 11 3 Example MPI CUDA Application eeeeeneeeneeoeeeeve enne e vaen na nene enn 47 Chapter 12 Advanced Settings www te v i ae eet ce sos de e asso sa ae sa a maaks used d asd sead siia 49 12 1 cuda use lockfile 2e error rt rette rx oe vett ee pitas oir e ae ti s tet etaan 49 12 2 set Cuda break on launch aseo epus aa ep IPSE RR eR XR Re NR SUR Ra AARONA ARANA Keina 49 12 3 set cuda epu busy check nd sa soree tnra rero NAS EAn TRORA edes req uv oen v
9. be marked pending until the ELF image of the kernel is loaded At that point the breakpoint will be resolved and its address will be updated When a breakpoint is set it forces all resident GPU threads to stop at this location when it hits that corresponding PC When a breakpoint is hit by one thread there is no guarantee that the other threads will hit the breakpoint at the same time Therefore the same breakpoint may be hit several times and the user must be careful with checking which thread s actually hit s the breakpoint 7 1 Symbolic Breakpoints To set a breakpoint at the entry of a function use the break command followed by the name of the function or method cuda gdb break my function cuda gdb break my class my method For templatized functions and methods the full signature must be given cuda gdb break int my templatized function lt int gt int The mangled name of the function can also be used To find the mangled name of a function you can use the following command cuda gdb set demangle style none cuda gdb info function my function name cuda gdb set demangle style auto www nvidia com CUDA Debugger DU 05227 042 v6 0 22 Breakpoints amp Watchpoints 7 2 Line Breakpoints To set a breakpoint on a specific line number use the following syntax cuda gdb break my file cu 185 If the specified line corresponds to an instruction within templatized code multiple breakpoints will be crea
10. be resumed past the assertion if needed Use the set cuda hide internal frames option to expose hide the system call frames hidden by default Temporary Directory By default the debugger API will use tmp as the directory to store temporary files To select a different directory the TMPDIR environment variable and the API CUDBG APICLIENT PID variable must be set www nvidia com CUDA Debugger DU 05227 042 _v6 0 6 Chapter 3 GETTING STARTED Included in this chapter are instructions for installing CUDA GDB and for using NVCC the NVIDIA CUDA compiler driver to compile CUDA programs for debugging 3 1 Installation Instructions Follow these steps to install CUDA GDB 1 Visit the NVIDIA CUDA Zone download page http www nvidia com object cuda_get html 2 Select the appropriate operating system MacOS X or Linux See Supported Platforms 3 Download and install the CUDA Driver 4 Download and install the CUDA Toolkit 3 2 Setting Up the Debugger Environment 3 2 1 Linux Set up the PATH and LD_LIBRARY_PATH environment variables export PATH usr local cuda 6 0 bin PATH export LD LIBRARY PATH usr local cuda 6 0 1ib64 usr local cuda 6 0 lib LD LIBRARY PATH 3 2 2 Mac OS X Set up environment variables export PATH Developer NVIDIA CUDA 6 0 bin PATH export DYLD LIBRARY PATH Developer NVIDIA CUDA 6 0 1ib S DYLD LIBRARY PATH www nvidia com CUDA Debugger DU 05227 042 v6 0 7 Getting Started
11. d idata sizeof int N ES cudaMemcpyHostToDevice 34 B5 bitreverse lt lt lt 1 N N sizeof int gt gt gt d 36 SU cudaMemcpy odata d sizeof int N 38 cudaMemcpyDeviceToHost 39 40 for al Of a lt Ny ssp 41 err Weal gt julmal oeral xexeleuee 41 42 43 cudaFree void d 44 return 0 45 11 1 1 Walking through the Code 1 Begin by compiling the bitreverse cu CUDA application for debugging by entering the following command at a shell prompt nvcc g G bitreverse cu o bitreverse This command assumes that the source file name is bitreverse cu and that no additional compiler flags are required for compilation See also Debug Compilation 2 Start the CUDA debugger by entering the following command at a shell prompt cuda gdb bitreverse www nvidia com CUDA Debugger DU 05227 042 v6 0 41 Walk Through Examples 3 Set breakpoints Set both the host main and GPU bitreverse breakpoints here Also set a breakpoint at a particular line in the device function bitreverse cu 18 cuda gdb break main Breakpoint 1 at 0x18e1 file bitreverse cu line 25 cuda gdb break bitreverse Breakpoint 2 at 0x18al file bitreverse cu line 8 cuda gdb break 21 Breakpoint 3 at 0x18ac file bitreverse cu line 21 4 Run the CUDA application and it executes until it reaches the first breakpoint main set in 3 cuda gdb run Starting program Users CUDA Userl docs bitrever
12. data int valuel value2 value3 value4 value5 sas alebeil 1 2 Leksa eel iollkoelkitek lt x w lollo liDhiim gt lt s idx2 threadIdx x Thebes Es Hebe p LeAF valuel data idx1 value2 data idx2 value3 valuel value2 value4 valuel value2 value5 value3 value4 x data idx3 valued data idx1 value3 data idx2 value4 idx1 idx2 idx3 07 Me imesta t eue clue T esesw L if alae host data NUM BLOCKS THREADS PER BLOCK ime elew dataz const int zero 0 Allocate an integer for each thread in each block for int block 0 block amp lt NUM BLOCKS bllockt for int thread 0 thread amp lt THREADS PER BLOCK thread int idx thread block THREADS PER BLOCK cudaMalloc amp amp host data idx sizeof int cudaMemcpy host data idx amp amp zero sizeof int cudaMemcpyHostToDevice fe Woials aigisyeices Gist cwersone Ineo plock or Ehiread 5 host_data 3 THREADS PER BLOCK 39 NULL Copy the array of pointers to the device cudaMalloc void amp amp dev data Sazeor MOSE catal cudaMemcpy dev data host data sizeof host data cudaMemcpyHostToDevice Execute example example amp lt amp lt amp lt NUM BLOCKS THREADS PER BLOCK gt gt gt dev data cudaThreadSynchronize In this small example we have an array of pointers to integers a
13. error we have now narrowed it down to a warp error so we now know that the thread that threw the exception must have been in the same warp as block 3 thread 32 In this example we have narrowed down the scope of the error from 512 threads down to 32 threads just by setting two autosteps and re running the program www nvidia com CUDA Debugger DU 05227 042 v6 0 46 Walk Through Examples 11 3 Example MPI CUDA Application For doing large MPI CUDA application debugging NVIDIA recommends using parallel debuggers supplied by our partners Allinea and Totalview Both make excellent parallel debuggers with extended support for CUDA However for debugging smaller applications or for debugging just a few processes in a large application CUDA GDB can easily be used If the cluster nodes have xterm support then it is quite easy to use CUDA GDB Just launch CUDA GDB in the same way you would have launched gdb mpirun np 4 host nvl nv2 xterm e cuda gdb a out You may have to export the DISPLAY variable to make sure that the xterm finds its way back to your display For example with Open MPI you would do something like this mpirun np 4 host nvl nv2 x DISPLAY host nvidia com 0 xterm e cuda gdb a out Different MPI implementations have different ways of exporting environment variables to the cluster nodes so check your documentation In the case where you cannot get xterm support you can insert a spin loop inside your progr
14. from the GPU while kernels with a positive grid ID have been launched from the CPU With the cudaDeviceSynchronize routine it is possible to see grid launches disappear from the device and then resume later after all child launches have completed 8 4 6 info cuda blocks This command displays all the active or running blocks for the kernel in focus The results are grouped per kernel This command supports filters and the default is kernel current block all The outputs are coalesced by default cuda gdb info cuda blocks BlockIdx Rome MEKAS OU Suec Kernel 1 x OOTO 19177101 0 Ley running Coalescing can be turned off as follows in which case more information on the Device and the SM get displayed cuda gdb set cuda coalescing off The following is the output of the same command when coalescing is turned off cuda gdb info cuda blocks BlockIdx State Dev SM Kernel 1 K KOMORO running 0 0 POTON running 0 3 122100710 running 0 6 3770770 running 0 9 4 0 0 running Q 12 5 7 0 9 running 15 6 0 0 running 183 770 0 running Qo 21 0 0 running 0 ji 8 4 7 info cuda threads This command displays the application s active CUDA blocks and threads with the total count of threads in those blocks Also displayed are the virtual PC and the associated source file and the line number information The results are grouped per kernel The command supports filters with default being kernel current block all threa
15. vs Hardware Coordinates A device thread belongs to a block which in turn belongs to a kernel Thread block and kernel are the software coordinates of the focus A device thread runs on a lane A lane belongs to a warp which belongs to an SM which in turn belongs to a device Lane warp SM and device are the hardware coordinates of the focus Software and hardware coordinates can be used interchangeably and simultaneously as long as they remain coherent Another software coordinate is sometimes used the grid The difference between a grid and a kernel is the scope The grid ID is unique per GPU whereas the kernel ID is unique across all GPUs Therefore there is a 1 1 mapping between a kernel and a grid device tuple Note If software preemption is enabled set cuda software preemption on hardware coordinates corresponding to a device thread are likely to change upon resuming execution on the device However software coordinates will remain intact and will not change for the lifetime of the device thread 5 2 Current Focus To inspect the current focus use the cuda command followed by the coordinates of interest www nvidia com CUDA Debugger DU 05227 042 v6 0 18 Kernel Focus cuda gdb cuda device sm warp lane block thread block O 0 0 tazea2d 050 0 clewies 0 Sin 0 esp 0 lane cuda gdb cuda kernel block thread kernel 1 block 0 0 0 thread 0 0 0 cuda gdb cuda kernel kernel 1 5 3 Switching Foc
16. 2 coalesced ranges will be printed one range for block 0 thread 0 to 7 and one range for block 1 thread 0 to 7 Because threads 8 15 in block 0 are not running the 2 ranges cannot be coalesced The command also supports breakpoint all and breakpoint breakpoint number as filters The former displays the threads that hit all CUDA breakpoints set by the user The latter displays the threads that hit the CUDA breakpoint breakpoint_number cuda gdb info cuda threads breakpoint all BlockIdx ThreadIdx Virtual PC Dev SM Wp Ln Filename Line Kernel 0 1 97 19 0 0 0 0x0000000000948e58 011 0 O infoCommands cu T2 19107109 1 0 0 0x0000000000948e58 011 0 1 infoCommands cu JE 19107109 2 0 0 0x0000000000948e58 011 0 2 infoCommands cu 12 19107100 3 0 0 0x0000000000948e58 011 0 3 infoCommands cu 12 19107109 4 0 0 0x0000000000948e58 011 0 4 infoCommands cu 12 1 00 5 0 0 0x0000000000948e58 0 O fo commanmcsqmyceu 172 cuda gdb info cuda threads breakpoint 2 lane 1 BlockIdx ThreadIdx Virtual PC Dev SM Wp Ln Filename Line Kernel 0 19107100 1 0 0 0x0000000000948e58 011 0 1 infoCommands cu 12 8 4 8 info cuda launch trace This command displays the kernel launch trace for the kernel in focus The first element in the trace is the kernel in focus The next element is the kernel that launched this www nvidia com CUDA Debugger DU 05227 042 v6 0 30 Inspecting Program State kernel The trace continues until there is no parent kerne
17. E TE YE dE dE Qj OERIE IE ERE dE TE JEE JE 1E 15 te we te www nvidia com CUDA Debugger DU 05227 042 v6 0 27 Inspecting Program State 8 4 3 info cuda warps This command takes you one level deeper and prints all the warps information for the SM in focus This command supports filters and the default is device current sm current warp all The command can be used to display which warp executes what block cuda gdb info cuda warps Wp Active Lanes Mask Divergent Lanes Mask Active Physical PC Kernel BlockIdx Device 0 SM 0 O OSR EEEE EEE 0x00000000 0x000000000000001c 0 0 0 0 iL OSSE SEI JEE JE ENE 0x00000000 0x0000000000000000 0 0 0 0 2 OSSE se IE JEE Jae ae 0x00000000 0x0000000000000000 0 0 0 0 3 OSSE SEI JEE JEE JE 0x00000000 0x0000000000000000 0 0 0 0 4 Opie BIE JE Te JE E AE 0x00000000 0x0000000000000000 0 0 0 0 5 ORE EAE 0x00000000 0x0000000000000000 0 0 0 0 6 IE SEE dE dE JE IE IE 0x00000000 0x0000000000000000 0 0 0 0 7 IE SEXE YE dE YE IE AE 0x00000000 0x0000000000000000 0 0 0 0 8 4 4 info cuda lanes This command displays all the lanes threads for the warp in focus This command supports filters and the default is device current sm current warp current lane all In the example below you can see that all the lanes are at the same physical PC The command can be used to display which lane executes what thread cuda gdb info cuda lanes Ln State Physical PC ThreadIdx Device 0 SM 0 Warp 0
18. NVIDIA CUDA GDB CUDA DEBUGGER TABLE OF CONTENTS Chapter 1 Introduction 3 sandi e cese bae secas esee isse ceccdecdddaceesdecatcacmetecceatac eios 1 1 1 What is CUDA GDB mem DET 1 1 2 Supported Feat res eese reet none Xx nl ne ER ERR AA clones nan SEA E ERI RI RES ENOTERO Kr 1 1 3 About This DOCUMEN Eis ies cioe ie ke oper n tea aree rU ERE PRES ERE dies ait dedveit e ERR E 2 Chapter 2 Release Notes ocio eese sese n nes nano tasuda nna a nnn nnne ike sead oan e ai 3 Chapter 3 Gettirig Started 22i eere Urdu EE rege A daamid a aadu 7 3 1 Installation Instr ctions erret rene neret then nn sen vee nn a ire Ie a ena ok av ua iin 7 3 2 Setting Up the Debugger Environment eeeeenneeneveeee e ehe eene eene 7 KNIE 0S See 7 32 25 TIEOLD CM 7 3 2 3 Temporary Directory eisai eama h nd Rn SEE ER Rr UR RT E EISE E ERR TR ENTE REF LES RE EORR 9 3 3 Compiling the Application eerreervenreeoeeeenee nene hehe se ehe ese eene 9 3 3 1 Debug Compilation issssesrccessasstiheeecderscneeradsedestsrsaeu rene ssa see ka gas e ERR Ra erae 9 3 3 2 Compiling For Specific GPU architectures eveeerreeeneeneen nan sene een eee 9 3 4 Using the Debugger eee ent heroe eorr E ERE n Ere rra Ke ue ke eaka late eal aidake 10 3 4 1 Single GPU Debugging eervevonseveaeoeeneaneo hehehe e
19. a launch timeout In addition multiple CUDA GDB sessions can debug CUDA applications context switching on the same GPU This feature is available on Linux with SM3 5 devices For information on enabling this please see Single GPU Debugging with the Desktop Manager Running and Multiple Debuggers Remote GPU Debugging CUDA GDB in conjunction with CUDA GDBSERVER can now be used to debug a CUDA application running on the remote host 5 0 Release Dynamic Parallelism Support CUDA GDB fully supports Dynamic Parallelism a new feature introduced with the 5 0 toolkit The debugger is able to track the kernels launched from another kernel and to inspect and modify variables like any other CPU launched kernel Attach Detach It is now possible to attach to a CUDA application that is already running It is also possible to detach from the application before letting it run to completion When attached all the usual features of the debugger are available to the user as if the application had been launched from the debugger This feature is also supported with applications using Dynamic Parallelism www nvidia com CUDA Debugger DU 05227 042 _v6 0 4 Release Notes Attach on exception Using the environment variable CUDA DEVICE WAITS ON EXCEPTION the application will run normally until a device exception occurs Then the application will wait for the debugger to attach itself to it for further debugging API Error Reporting Checking the
20. ad where an exception occurred can be found quickly and with much less effort by using autostep Usage autostep LOCATION autostep LOCATION for LENGTH lines instructions gt LOCATION may be anything that you use to specify the location of a breakpoint such as a line number function name or an instruction address preceded by an asterisk If no LOCATION is specified then the current instruction address is used www nvidia com CUDA Debugger DU 05227 042 v6 0 38 Automatic Error Checking gt LENGTH specifies the size of the autostep window in number of lines or instructions lines and instructions can be shortened e g or i If the length type is not specified then lines is the default If the or clause is omitted then the default is 1 line gt astep can be used as an alias for the autostep command gt Calls to functions made during an autostep will be stepped over gt In case of divergence the length of the autostep window is determined by the number of lines or instructions the first active lane in each warp executes Divergent lanes are also single stepped but the instructions they execute do not count towards the length of the autostep window gt If a breakpoint occurs while inside an autostep window the warp where the breakpoint was hit will not continue autostepping when the program is resumed However other warps may continue autostepping gt Overlapping autosteps are not supported If an a
21. am This works in just the same way as when using gdb on a host only program Somewhere near the start of your program add a code snippet like the following nE r 0 chari MOSE L259 9 printf PID d on node s is ready for attach n getpid host fftlushtstdout r while 0 i sleep 5 Then recompile and run the program After it starts ssh to the nodes of interest and attach to the process Set the variable i to 1 to break out of the loop mpirun np 2 host nvl nv2 a out PID 20060 on node nvl is ready for attach PID 5488 on node nv2 is ready for attach nvi cuda gdb pid 5488 nv2 cuda gdb pid 20060 For larger applications in the case where you may just want to attach to a few of the processes you can conditionalize the spin loop based on the rank Most MPIs set an environment variable that is the rank of the process For Open MPI it is www nvidia com CUDA Debugger DU 05227 042 v6 0 47 Walk Through Examples OMPI COMM WORLD RANK and for MVAPICH it is MV2 COMM WORLD RANK Assuming you want to attach to rank 42 you could add a spin loop like this char stoprank stoprank getenv OMPI COMM WORLD RANK if 42 atoi stoprank m r ab ex OA char hostname 256 printf PID d on s ready for attach n getpid hostname ttloshistdoub r while 0 i sleep 5 Note that by default CUDA GDB allows debugging a single process per node The workaround described in Multiple Debu
22. b thread 1 Switching to thread 1 process 16738 0 0x000019d5 in main at bitreverse cu 34 34 bitreverse 1 N N sizeof int d cuda gdb backtrace 0 0x000019d5 in main at bitreverse cu 34 cuda gdb info cuda kernels Kernel Dev Grid SMs Mask GridDim BlockDim Name Args 0 0 1 0x00000001 1 1 1 256 1 1 bitreverse data 0x110000 cuda gdb cuda kernel 0 Switching focus to CUDA kernel 0 grid 1 block 0 0 0 thread 0 0 0 device 0 sm 0 warp 0 lane 0 9 unsigned int idata unsigned int data cuda gdb backtrace 0 bitreverse lt lt lt 1 1 1 256 1 1 gt gt gt data 0x110000 at bitreverse cu 9 7 Corroborate this information by printing the block and thread indexes cuda gdb print blockIdx Gl x O gy Ol cuda gdb print threadIdx 92 x Q gy O 2 0 8 The grid and block dimensions can also be printed cuda gdb print gridDim 93 x 1 y 1 cuda gdb print blockDim 4 x 256 y 1 z 1 9 Advance kernel execution and verify some data cuda gdb next 1 2 array threadIdx x idata threadIdx x cuda gdb next 14 array threadIdx x 0xf0f0f0f0 amp array threadIdx x gt gt 4 cuda gdb next 16 array threadIdx x Oxcccccccc amp array threadIdx x gt gt 2 cuda gdb next 18 array threadIdx x 0xaaaaaaaa amp array threadIdx x gt gt 1 cuda gdb next Breakpoint 3 bitreverse lt lt lt 1 1 256 1 1 gt gt gt
23. bles can be stored either in registers or in local shared const or global memory You can print the address of any variable to find out where it is stored and directly access the associated memory The example below shows how the variable array which is of type shared int can be directly accessed in order to see what the stored values are in the array cuda gdb print amp array 1 8shared int 0 0x20 cuda gdb print array 0 4 S2 10 128 64 192 You can also access the shared memory indexed into the starting offset to see what the stored values are cuda gdb print shared int 0x20 S cuda gdb print shared int 0x24 4 128 cuda gdb print shared int 0x28 S5 64 www nvidia com CUDA Debugger DU 05227 042 v6 0 25 Inspecting Program State The example below shows how to access the starting address of the input parameter to the kernel cuda gdb print amp data 6 const global void const parameter 0x10 cuda gdb print Gglobal void const parameter 0x10 7 global void const parameter 0x110000 lt gt 8 3 Inspecting Textures The debugger can always read write the source variables when the PC is on the first assembly instruction of a source instruction When doing assembly level debugging the value of source variables is not always accessible To inspect a texture use the print command while de referencing the texture recast to the type of the array
24. d all The outputs are coalesced by default as follows www nvidia com CUDA Debugger DU 05227 042 v6 0 29 Inspecting Program State cuda gdb info cuda threads BlockIdx ThreadIdx To BlockIdx ThreadIdx Count Virtual PC Filename Line Device 0 SM 0 0 0 0 O 0 0 07 979 31 9 32 0x000000000088 88c Bose STe 0 0 0 32 0 0 191 0 0 127 0 0 24544 0x000000000088 800 acos cu 374 Coalescing can be turned off as follows in which case more information is displayed with the output cuda gdb info cuda threads BlockIdx ThreadIdx Virtual PC Dev SM Wp Ln Filename Line Kernel 1 a 07107109 0 0 0 0x000000000088f88c 0 0 8 acos cu 376 97 07 9 1 0 0 0x000000000088f88c 0 1 acos cu 376 0 0 0 2 0 0 Ox000000000088 88c 0 O0 0 2 Bees 376 0 0 0 3 0 0 0x000000000088f88c 8 8 3 acos cu 376 0 0 0 4 0 0 0x000000000088f88c GO 8 4 ge esu 376 0 0 0 5 0 0 0x000000000088f 88c G Q9 35 acos cu 376 0 0 0 6 0 0 0x000000000088f88c 0 0 6 BS OS CU 376 0 0 0 7 0 0 Ox000000000088 88c 0 0 0 7 acos cu 376 0 0 0 8 0 0 Ox000000000088 88c 0 0 8 acos cu 376 0 0 0 9 0 0 Ox000000000088 88c 0 0 39 ACOs HEU 376 In coalesced form threads must be contiguous in order to be coalesced If some threads are not currently running on the hardware they will create holes in the thread ranges For instance if a kernel consist of 2 blocks of 16 threads and only the 8 lowest threads are active then
25. d a Linux server It is not possible to remotely debug a CUDA application running on Mac OS X Setting remote debugging that way is a 2 step process Launch the cuda gdbserver on the remote host cuda gdbserver can be launched on the remote host in different operation modes Option 1 Launch a new application in debug mode To launch a new application in debug mode invoke cuda gdb server as follows cuda gdbserver 1234 app invocation www nvidia com CUDA Debugger DU 05227 042 v6 0 13 Getting Started Where 1234 is the TCP port number that cuda gdbserver will listen to for incoming connections from cuda gdb and app invocation is the invocation command to launch the application arguments included gt Option 2 Attach cuda gdbserver to the running process To attach cuda gdbserver to an already running process the attach option followed by process identification number PID must be used cuda gdbserver 1234 attach 5678 Where 1234 is the TCP port number and 5678 is process identifier of the application cuda gdbserver must be attached to When debugging a 32 bit application on a 64 bit server cuda gdbserver must also be 32 bit Launch cuda gdb on the client Configure cuda gdb to connect to the remote target using either cuda gdb target remote or cuda gdb target extended remote It is recommended to use set sysroot command if libraries installed on the debug target might differ from the ones installed
26. data 0x100000 at Duttsevessemcu 2 21 idata threadIdx x array threadIdx x cuda gdb print array 0 12 7 t0 128 64 192 32 160 O6 224 16 144 80 209 cuda gdb print x array 0 12 SS 00 MO 07 MO 1 MO ZA MO 20 MI xa PE 0x60 Oeo ENO OX AN MO 0x40 cuda gdb print amp data 9 global void parameter 0x10 cuda gdb print global void parameter 0x10 10 global void parameter 0x100000 The resulting output depends on the current content of the memory location www nvidia com CUDA Debugger DU 05227 042 _v6 0 43 Walk Through Examples 10Since thread 0 0 0 reverses the value of 0 switch to a different thread to show more interesting data cuda gdb cuda thread 170 Switching focus to CUDA kernel 0 grid 1 block 0 0 0 thread 170 0 0 device 0 sm 0 warp 5 lane 10 11Delete the breakpoints and continue the program to completion cuda gdb delete breakpoints Delete all breakpoints y or n y cuda gdb continue Continuing Program exited normally cuda gdb 11 2 Example autostep This section shows how to use the autostep command and demonstrates how it helps increase the precision of memory error reporting www nvidia com CUDA Debugger DU 05227 042 _v6 0 44 Walk Through Examples Source Code define NUM BLOCKS 8 define THREADS PER BLOCK 64 kelles en Sy SS SS EN GER es O CO I Oy CaS Coo IUe global void example int
27. debugger will return an error if at least one visible device is already in use for display It is the default setting 12 4 set cuda launch blocking When enabled the kernel launches are synchronous as if the environment variable CUDA LAUNCH BLOCKING had been set to 1 Once blocking the launches are effectively serialized and may be easier to debug cuda gdb set cuda launch blocking off The kernel launches are launched synchronously or asynchronously as dictacted by the application This is the default cuda gdb set cuda launch blocking on The kernel launches are synchronous If the application has already started the change will only take affect after the current session has terminated www nvidia com CUDA Debugger DU 05227 042 v6 0 50 Advanced Settings 12 5 set cuda notify Any time a CUDA event occurs the debugger needs to be notified The notification takes place in the form of a signal being sent to a host thread The host thread to receive that special signal is determined with the set cuda notify option cuda gdb set cuda notify youngest The host thread with the smallest thread id will receive the notification signal default cuda gdb set cuda notify random An arbitrary host thread will receive the notification signal 12 6 set cuda ptx cache Before accessing the value of a variable the debugger checks whether the variable is live or not at the current PC On CUDA devices the variab
28. dia com cuda gpus The www nvidia com CUDA Debugger DU 05227 042 _v6 0 9 Getting Started same application can be compiled for multiple GPU architectures Use the gencode compilation option to dictacte which GPU architecture to compile for The option can be specified multiple times For instance to compile an application for a GPU with compute capability 3 0 add the following flag to the compilation command gencode arch compute 30 code sm 30 To compile PTX code for any future architecture past the compute capability 3 5 add the following flag to the compilation command gencode arch compute 35 code compute 35 For additional information please consult the compiler documentation at http docs nvidia com cuda cuda compiler driver nvcc index html extended notation 3 4 Using the Debugger Debugging a CUDA GPU involves pausing that GPU When the graphics desktop manager is running on the same GPU then debugging that GPU freezes the GUI and makes the desktop unusable To avoid this use CUDA GDB in the following system configurations 3 4 1 Single GPU Debugging In a single GPU system CUDA GDB can be used to debug CUDA applications only if no X11 server on Linux or no Aqua desktop manager on Mac OS X is running on that system On Linux On Linux you can stop the X11 server by stopping the ligthgdm service or the equivalent for the target Linux distribution On Mac OS X On Mac OS X you can log in with consol
29. e as the user name in the desktop UI login screen To enable console login option open the System Prerences gt Users amp Group gt Login Options tab set automatic login option to Off and set Display login window as to Name and password To launch debug cuda applications in console mode on systems with an integrated GPU and a discrete GPU also make sure that the Automatic Graphics Switching option in the System Settings gt Energy Saver tab is unchecked www nvidia com CUDA Debugger DU 05227 042 v6 0 10 Getting Started 3 4 2 Single GPU Debugging with the Desktop Manager Running CUDA GDB can be used to debug CUDA applications on the same GPU that is running the desktop GUL This is a BETA feature available on Linux and supports devices with SM3 5 compute capability There are two ways to enable this functionality gt Use the following command set cuda software preemption on gt Export the following environment variable CUDA_DEBUGGER_SOFTWARE_PREEMPTION 1 Either of the options above will activate software preemption These options must be set prior to running the application When the GPU hits a breakpoint or any other event that would normally cause the GPU to freeze CUDA GDB releases the GPU for use by the desktop or other applications This enables CUDA GDB to debug a CUDA application on the same GPU that is running the desktop GUI and also enables debugging of multiple CUDA applications context switching
30. ebugger DU 05227 042 v6 0 20 Program Execution With Dynamic Parallelism on sm 35 several CUDA APIs can now be instantiated from the device The following list defines single step behavior when encountering these APIs gt When encountering device side kernel launches denoted by the lt lt lt gt gt gt launch syntax the step and next commands will have the same behavior and both will step over the launch call When encountering cudaDeviceSynchronize the launch synchronization routine the step and next commands will have the same behavior and both will step over the call When stepping over the call the entire device is resumed until the call has completed at which point the device is suspended without user intervention gt When stepping a device grid launch to completion focus will automatically switch back to the CPU The cuda kernel focus switching command must be used to switch to another grid of interest if one is still resident LJ It is not possible to step into a device launch call nor the routine launched by the call www nvidia com CUDA Debugger DU 05227 042 v6 0 21 Chapter 7 BREAKPOINTS amp WATCHPOINTS There are multiple ways to set a breakpoint on a CUDA application Those methods are described below The commands to set a breakpoint on the device code are the same as the commands used to set a breakpoint on the host code If the breakpoint is set on device code the breakpoint will
31. ebugger DU 05227 042 v6 0 39 Chapter 11 WALK THROUGH EXAMPLES The chapter contains two CUDA GDB walk through examples Example bitreverse gt Example autostep Example MPI CUDA Application 11 1 Example bitreverse This section presents a walk through of CUDA GDB by debugging a sample application called bitreverse that performs a simple 8 bit reversal on a data set www nvidia com CUDA Debugger DU 05227 042 v6 0 40 Walk Through Examples Source Code 1 include lt stdio h gt 2 include lt stdlib h gt 3 4 Simple 8 bit bit reversal Compute test 5 6 define N 256 7 8 global void bitreverse void data 9 unsigned int idata unsigned int data 10 Mist Shared dat array 1IL 12 array threadIdx x idata threadIdx x 13 14 array threadIdx x 0xf0f0f0f0 amp array threadIdx x gt gt 4 IS 0xOf0f0f0f amp array threadIdx x lt lt 4 16 array threadIdx x Oxcccccccc amp array threadIdx x gt gt 2 17 OxXSSSSIISS val Euereny s Rar sadiilds MITE 18 array threadIdx x 0xaaaaaaaa amp array threadIdx x gt gt 1 19 OxS 5555555 curse he aser NT lt lt 3D p 20 Zl idata threadIdx x array threadIdx x 22 23 24 ae marn Vond 25 vote we INGLIS mE 3bp 26 unsigned int idata N odata N 23 28 cor i 0p 4 lt Ne id 29 idata i unsigned int i 30 31 cudaMalloc void amp d sizeof int N 32 cudaMemcpy
32. ene nhe nne ee ere vaee nano nne 10 3 4 2 Single GPU Debugging with the Desktop Manager Running eeeeeenee 11 3 4 3 Multi GPU Deb gging i eose v mateas nest ae e ea s na osad venis kes on veeosa osis e an Riera 11 3 4 4 Multi GPU Debugging in Console Mode eeeeeeeenneenenenaeeneeeenee 12 3 4 5 Multi GPU Debugging with the Desktop Manager Running eeeeeeeeeee 12 3 4 6 Remote Debugging iusvensssn cea enter ero trahe ER RR EET ERR ERA E EFE duona ERE EU 13 3 4 7 Multiple Debt ggers roo ieri re SAR ad xir noeuds auxi des E ROREM HER MEE 14 3 4 8 Attaching Detachilig 52er roro eorr Dre tare eo c ee demand kaevu kiusata Uu 15 3 4 9 CUDA OpenGL Interop Applications on Linux eeeeeeereeeneeneeeeee enne 15 Chapter 4 CUDA GDB EXEeRsiOIISo2 232 1 90502 eda a1 re Sua ctia aa aduta eo oid eam Os DEDE 16 4 1 Command Naming Convention vereeneeeeneenaennae nene eve hene enne 16 4 2 Getting HED mE 16 4 3 Initialization Fil 3 aar ren sews evans SENKORE TRENE R ERARA E E EROR EMI E CE 16 4 4 GUI IntegratiOn iiei eorr terret ntn ont er per vh enn ren exc xa se vx I eter onan esau e 17 Chapter 5 Kernel FOoCs ooo eren SEEDEDE n Rss nn sao Sa ss ansa ade 18 5 1 Software Coordinates vs Hardware Coordinates eeeeveneenneeneeeoeeena 18 5 2 Current FOCUS sovvsssatuss a
33. enue e edu aema 50 12 4 set Cuda launch blocking eevvveveveveevvveeavevea eve va nnn enhn ntn enhn ern 50 12 5 set cuda NOLES secos ox tosu t SEN ESS ORE E ER d seme EU Dru TE SEE TONERNE a emcees 51 12 6 set cuda ptx cache cesis eic rer teh eon Iter ea n Pa xw vaate ERE dv d namiidi dd n me male ie 51 12 7 set cuda single stepping optimizations eereerereeeveveeenveeeneeee one ee en 51 12 8 set Cuda thread selection eveeveeenaveevavneenavaaenana nene vaen ense 52 www nvidia com CUDA Debugger DU 05227 042 v6 0 iii 12 9 set Cuda value extrapolation ee eeerenenoeeneee even even ena e naa e nee eene enne 52 Appendix A Supported Platforms 5seeeeesaneennnenenennesevesa nasa dasen es seedi se enese een ae eee 53 Appendix B Known Ss es sites s nasse ea ase vase aii 54 www nvidia com CUDA Debugger DU 05227 042 v6 0 iv Table 1 CUDA Exception Codes www nvidia com CUDA Debugger LIST OF TABLES DU 05227 042 v6 0 v www nvidia com CUDA Debugger DU 05227 042 v6 0 vi Chapter 1 INTRODUCTION This document introduces CUDA GDB the NVIDIA CUDA debugger for Linux and Mac OS 1 1 What is CUDA GDB CUDA GDB is the NVIDIA tool for debugging CUDA applications running on Linux and Mac CUDA GDB is an extension to the x86 64 port of GDB the GNU Project debugger The tool provides developers with a mechanism fo
34. ereeeeneeneeeene enne nae naene nee eee 25 8 3 Inspecting TextUres iiiiese esee eina hr RRRRERSSRERRERRRERR RRER seseneenaaeses a oeaks vee canes 26 8 4 Info CUDA CoMmaNdS kustuvad eene uer eta er bUke rea tre y ders Drev e mia EEEE t D Ne te Ninus 26 8 4 1 info cuda devices cess en etttadksva coke ce esa E e e EN EE A E REV NE ZEN ERR a aia vald 27 8 4 2 info CUA SMS ie eec rore arre tU Mer es va sedan ne de anda a SERE REO DNUS SE REESE TS 27 8 4 3 INTO CUGA Wal DSiecessese noscere rete pie Tek we E ages boue Pays oie EAEE ev bia ek rut 28 8 4 4 info c da lanes 444 saksamaa etek eu R ee E a cR ve ep va ene Feu eX REN Tee 28 8 4 5 info cuda kernels erret tio eere ree dee oo ear da sedate oon c cro DER atv jenka EE 28 8 4 6 MO c da DIOCKS ERO 29 8 4 7 info Cuda threads 2s iisessciderthsteswndedsaasienndewt A SIAE FEANEN URP EMITE SMS 29 8 4 8 info cuda launch traC cece cece eee e ence eee Press S esee eee ehem eth nennen enn 30 8 4 9 info Cuda launch CHILCreN cece eee cece eee e eee e sto e eens eene ehe ehe enhn 31 84710 info c da CONE S cas massina krai alajd miki becca dates cached kir da es ONDES 31 8 4 11 info cuda MANAGEM cece eee An nnne enne 31 8 5 Disassembly saec tese idees add kaasanded ea gulae eva agp kadu kka a 32 Chapter 9 Event NotifiCatiorns 2 1 oeeta son dure dai iesaka saamaks alamad add RO riae dca 33 9 1 Context EVENTS 4 sdu TI II a a a a aie iaiia 33 9 2
35. error code of all the CUDA driver API and CUDA runtime API function calls is vital to ensure the correctness of a CUDA application Now the debugger is able to report and even stop when any API call returns an error See set cuda api failures for more information Inlined Subroutine Support Inlined subroutines are now accessible from the debugger on SM 2 0 and above The user can inspect the local variables of those subroutines and visit the call frame stack as if the routines were not inlined 4 2 Release Kepler Support The primary change in Release 4 2 of CUDA GDB is the addition of support for the new Kepler architecture There are no other user visible changes in this release 4 1 Release Source Base Upgraded to GDB 7 2 Until now CUDA GDB was based on GDB 6 6 on Linux and GDB 6 3 5 on Darwin the Apple branch Now both versions of CUDA GDB are using the same 7 2 source base Now CUDA GDB supports newer versions of GCC tested up to GCC 4 5 has better support for DWARF3 debug information and better C debugging support Simultaneous Sessions Support With the 4 1 release the single CUDA GDB process restriction is lifted Now multiple CUDA GDB sessions are allowed to co exist as long as the GPUs are not shared between the applications being processed For instance one CUDA GDB process can debug process foo using GPU 0 while another CUDA GDB process debugs process bar using GPU 1 The exclusive of GPUs can be enforced wi
36. ers about kernel events and context events Within CUDA GDB kernel refers to the device code that executes on the GPU while context refers to the virtual address space on the GPU for the kernel You can enable output of CUDA context and kernel events to review the flow of the active contexts and kernels By default only context event messages are displayed 9 1 Context Events Any time a CUDA context is created pushed popped or destroyed by the application CUDA GDB will display a notification message The message includes the context id and the device id to which the context belongs Context Create of context Oxad2fe60 on Device 0 Context Destroy of context Oxad2fe60 on Device 0 The context event notification policy is controlled with the context events option cuda gdb set cuda context events off CUDA GDB does not display the context event notification messages cuda gdb set cuda context events on CUDA GDB displays the context event notification messages default 9 2 Kernel Events Any time CUDA GDB is made aware of the launch or the termination of a CUDA kernel a notification message can be displayed The message includes the kernel id the kernel name and the device to which the kernel belongs Launch of CUDA Kernel 1 kernel3 on Device 0 Termination of CUDA Kernel 1 kernel3 on Device 0 www nvidia com CUDA Debugger DU 05227 042 v6 0 33 Event Notifications The kernel event notification polic
37. ggers does not work with MPI applications If CUDA_VISIBLE_DEVICES is set it may cause problems with the GPU selection logic in the MPI application It may also prevent CUDA IPC working between GPUs on a node In order to start multiple CUDA GDB sessions to debug individual MPI processes on the same node use the cuda use lockfile 0 option when starting CUDA GDB as described in cuda use lockfile Each MPI process must guarantee it targets a unique GPU for this to work properly www nvidia com CUDA Debugger DU 05227 042 _v6 0 48 Chapter 12 ADVANCED SETTINGS 12 1 cuda use lockfile When debugging an application CUDA GDB will suspend all the visible CUDA capable devices To avoid any resource conflict only one CUDA GDB session is allowed at a time To enforce this restriction CUDA GDB uses a locking mechanism implemented with a lock file That lock file prevents 2 CUDA GDB processes from running simultaneously However if the user desires to debug two applications simultaneously through two separate CUDA GDB sessions the following solutions exist gt Use the CUDA VISIBLE DEVICES environment variable to target unique GPUs for each CUDA GDB session This is described in more detail in Multiple Debuggers gt Lift the lockfile restriction by using the cuda use lockfile command line option cuda gdb cuda use lockfile 0 my app This option is the recommended solution when debugging multiple ranks of an MPI applicatio
38. he cuda gdb prompt will appear At that point the program can be inspected modified single stepped resumed or terminated at the user s discretion This feature is limited to applications running within the debugger It is not possible to break into and debug applications that have been launched outside the debugger 6 2 Single Stepping Single stepping device code is supported However unlike host code single stepping device code single stepping works at the warp level This means that single stepping a device kernel advances all the active threads in the warp currently in focus The divergent threads in the warp are not single stepped In order to advance the execution of more than one warp a breakpoint must be set at the desired location and then the application must be fully resumed A special case is single stepping over a thread barrier call syncthreads In this case an implicit temporary breakpoint is set immediately after the barrier and all threads are resumed until the temporary breakpoint is hit On GPUs with sm type lower than sm 20 it is not possible to step over a subroutine in the device code Instead CUDA GDB always steps into the device function On GPUs with sm type sm 20 and higher you can step in over or out of the device functions as long as they are not inlined To force a function to not be inlined by the compiler the noinline keyword must be added to the function declaration www nvidia com CUDA D
39. ion This occurs when any thread within a warp advances its PC beyond the 40 bit address space This occurs when any thread in a warp triggers a hardware stack overflow This should be a rare occurrence This occurs when a thread accesses an illegal out of bounds global address For increased precision use the set cuda memcheck option This occurs when a thread accesses a global address that is not correctly aligned This occurs when any thread in the warp hits a device side assertion This occurs when a thread corrupts the heap by invoking free with an invalid address for example trying to free the same memory region twice This occurs when a thread accesses an illegal out of bounds global local shared address For increased precision use the set cuda memcheck option This occurs when a host thread attempts to access managed memory currently used by the GPU DU 05227 042 _v6 0 37 Automatic Error Checking Once CUDA memcheck is enabled any detection of global memory violations and mis aligned global memory accesses will be reported When CUDA memcheck is enabled all the kernel launches are made blocking as if the environment variable CUDA LAUNCH BLOCKING was set to 1 The host thread launching a kernel will therefore wait until the kernel has completed before proceeding This may change the behavior of your application You can also run the CUDA memory checker as a standalone tool named CUDA
40. ise Warp error This occurs when any thread within a Misaligned Address warp accesses an address in the local or shared memory segments that is not correctly aligned CUDA EXCEPTION 7 Warp Not precise Warp error This occurs when any thread within Invalid Address Space a warp executes an instruction that accesses a memory space not permitted for that instruction www nvidia com CUDA Debugger DU 05227 042 v6 0 36 Exception Code CUDA EXCEPTION 8 Warp Invalid PC CUDA EXCEPTION 9 Warp Hardware Stack Overflow CUDA EXCEPTION 10 Device Illegal Address CUDA EXCEPTION 11 Lane Misaligned Address CUDA EXCEPTION 12 Warp Assert CUDA EXCEPTION 13 Lane Syscall Error CUDA EXCEPTION 14 Warp Illegal Address CUDA EXCEPTION 15 Invalid Managed Memory Access Precision of the Error un ii il Precise Requires memcheck on Precise Precise Requires memcheck on Scope of the Error Warp error Warp error Global error Per lane thread error id Per lane thread error M Per host thread 10 3 set cuda memcheck The CUDA memcheck feature detects global memory violations and mis aligned global memory accesses This feature is off by default and can be enabled using the following variable in CUDA GDB before the application is run cuda gdb www nvidia com CUDA Debugger set cuda memcheck on Automatic Error Checking Descript
41. it is bound to For instance if texture tex is bound to array A of type float use cuda gdb print texture float tex All the array operators such as can be applied to texture float tex cuda gdb print texture float tex 2 cuda gdb print texture float tex 2 4 8 4 Info CUDA Commands These are commands that display information about the GPU and the application s CUDA state The available options are devices information about all the devices sms information about all the SMs in the current device warps information about all the warps in the current SM lanes information about all the lanes in the current warp kernels information about all the active kernels blocks information about all the active blocks in the current kernel threads information about all the active threads in the current kernel launch trace information about the parent kernels of the kernel in focus launch children information about the kernels launched by the kernels in focus contexts information about all the contexts www nvidia com CUDA Debugger DU 05227 042 _v6 0 26 Inspecting Program State A filter can be applied to every info cuda command The filter restricts the scope of the command A filter is composed of one or more restrictions A restriction can be any of the following gt device n gt smn gt warp n gt lane n gt kernel n gt gridn v block x y orblock x y gt th
42. l In that case the kernel is CPU launched For each kernel in the trace the command prints the level of the kernel in the trace the kernel ID the device ID the grid Id the status the kernel dimensions the kernel name and the kernel arguments cuda gdb info cuda launch trace Lvl Kernel Dev Grid Status GridDim BlockDim Invocation i 0 3 0 Activen 925151 16 1 1 kernel3 c 5 jJ 2 0 5 Terminated 240 1 1 128 1 1 kernel2 b 3 2 i 0 2 Active 240 1 1 128 1 1 kernell a 1 A kernel that has been launched but that is not running on the GPU will have a Pending status A kernel currently running on the GPU will be marked as Active A kernel waiting to become active again will be displayed as Sleeping When a kernel has terminated it is marked as Terminated For the few cases when the debugger cannot determine if a kernel is pending or terminated the status is set to Undetermined This command supports filters and the default is kernel all With set cuda software preemption on no kernel will be reported as active 8 4 9 info cuda launch children This command displays the list of non terminated kernels launched by the kernel in focus For each kernel the kernel ID the device ID the grid Id the kernel dimensions the kernel name and the kernel parameters are displayed cuda gdb info cuda launch children Kernel Dev Grid GridDim BlockDim Invocation g E 0 F Q1 1 1 1 kernel5 a 3 18 0 8 1 1 1 32 1 1 ker
43. les may not be live all the time and will be reported as Optimized Out CUDA GDB offers an option to circumvent this limitation by caching the value of the variable at the PTX register level Each source variable is compiled into a PTX register which is later mapped to one or more hardware registers Using the debug information emitted by the compiler the debugger may be able cache the value of a PTX register based on the latest hardware register it was mapped to at an earlier time This optimization is always correct When enabled the cached value will be displayed as the normal value read from an actual hardware register and indicated with the cached prefix The optimization will only kick in while single stepping the code cuda gdb set cuda ptx cache off The debugger only read the value of live variables cuda gdb set cuda ptx cache on The debugger will use the cached value when possible This setting is the default and is always safe 12 7 set cuda single stepping optimizations Single stepping can take a lot of time When enabled this option tells the debugger to use safe tricks to accelerate single stepping cuda gdb set cuda single stepping optimizations off The debugger will not try to accelerate single stepping This is the unique and default behavior in the 5 5 release and earlier www nvidia com CUDA Debugger DU 05227 042 v6 0 51 Advanced Settings cuda gdb set cuda single stepping optimizati
44. n that uses separate GPUs for each rank It is also required when using software preemption set cuda software preemption on to debug multiple CUDA applications context switching on the same GPU 12 2 set cuda break_on_launch To break on the first instruction of every launched kernel set the break_on_launch option to application cuda gdb set cuda break on launch application Possible options are www nvidia com CUDA Debugger DU 05227 042 v6 0 49 Advanced Settings none no kernel application or system default application kernel launched by the user application system any kernel launched by the driver such as memset all any kernel application and system Those automatic breakpoints are not displayed by the info breakpoints command and are managed separately from individual breakpoints Turning off the option will not delete other individual breakpoints set to the same address and vice versa 12 3 set cuda gpu busy check As a safeguard mechanism cuda gdb will detect if a visible device is also used for display and return an error A device used for display cannot be used for compute while debugging To hide the device use the CUDA VISIBLE DEVICES environment variable This option is only valid on Mac OS X cuda gdb set cuda gpu busy check off The safeguard mechanism is turned off and the user is responsible for guaranteeing the device can safely be used cuda gdb set cuda gpu busy check on The
45. nd we want to do some operations on the integers Suppose however that one of the pointers is NULL as shown in line 38 This will cause CUDA EXCEPTION 10 Device Illegal Address to be thrown when we try to access the integer that corresponds with block 3 thread 39 This exception should occur at line 16 when we try to write to that value 11 2 1 Debugging with Autosteps 1 Compile the example and start CUDA GDB as normal We begin by running the program www nvidia com CUDA Debugger DU 05227 042 v6 0 45 Walk Through Examples cuda gdb run Starting program home jitud cudagdb test autostep ex example Thread debugging using libthread db enabled New Thread Ox7ffff5688700 LWP 9083 Context Create of context 0x617270 on Device 0 Launch of CUDA Kernel 0 example lt lt lt 8 1 1 64 1 1 gt gt gt on Device 0 Program received signal CUDA EXCEPTION 10 Device Illegal Address Steele KOSUS SA CUDA kerast 0 geid 1 block 417 00 tarea 07070 7 device 0 sm 1 warp 0 lane 0 0x0000000000796 60 in exampl data 0x200300000 at example cu 17 17 data idx1 value3 As expected we received a CUDA EXCEPTION 10 However the reported thread is block 1 thread 0 and the line is 17 Since CUDA EXCEPTION 10 is a Global error there is no thread information that is reported so we would manually have to inspect all 512 threads 2 Set autosteps To get more accurate information we reas
46. ned in this publication are subject to change without notice This publication supersedes and replaces all other information previously supplied NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation Trademarks NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U S and other countries Other company and product names may be trademarks of the respective companies with which they are associated Copyright 2007 2014 NVIDIA Corporation All rights reserved e www nvidia com nVIDIA
47. nel4 b 5 This command supports filters and the default is kernel all 8 4 10 info cuda contexts This command enumerates all the CUDA contexts running on all GPUs A indicates the context currently in focus This command shows whether a context is currently active on a device or not cuda gdb info cuda contexts Context Dev State 0x080b9518 0 inactive 0x08067948 0 active 8 4 11 info cuda managed This command shows all the static managed variables on the device or on the host depending on the focus www nvidia com CUDA Debugger DU 05227 042 v6 0 31 Inspecting Program State cuda gdb info cuda managed Static managed variables on device 0 ar managed var 3 managed consts on i 2 71000004 pi 3 1400000000000001 8 5 Disassembly The device SASS code can be disassembled using the standard GDB disassembly instructions such as x i and display i cuda gdb x 4 pc gt Ox7a5cf0 lt Z9fool0Params Ox7a5cf8 lt Z9fool0Params 0x7a5d00 lt Z9fool0Params 0x7a5d08 lt Z9fool0Params F752 IMUL R2 RO R3 760 gt MOV R3 R4 768 IMUL RO RO R3 776 IADD R18 RO R3 Params Params Params Params 3 4 4 4 For disassembly instruction to work properly cuobjdump must be installed and present in your PATH www nvidia com CUDA Debugger DU 05227 042 _v6 0 32 Chapter 9 EVENT NOTIFICATIONS As the application is making forward progress CUDA GDB notifies the us
48. ntime referred to as just in time compilation or JIT compilation for short 1 3 About This Document This document is the main documentation for CUDA GDB and is organized more as a user manual than a reference manual The rest of the document will describe how to install and use CUDA GDB to debug CUDA kernels and how to use the new CUDA commands that have been added to GDB Some walk through examples are also provided It is assumed that the user already knows the basic GDB commands used to debug host applications www nvidia com CUDA Debugger DU 05227 042 _v6 0 2 Chapter 2 RELEASE NOTES 6 0 Release Unified Memory Support Managed variables can be read and written from either a host thread or a device thread The debugger also annotates memory addresses that reside in managed memory with managed The list of statically allocated managed variables can be accessed through a new info cuda managed command GDB 7 6 Code Base The code base for CUDA GDB was upgraded from GDB 7 2 to GDB 7 6 Android Support CUDA GDB can now be used to debug Android applications either locally or remotely Single Stepping Optimizations CUDA GDB can now use optimized methods to single step the program which accelerate single stepping most of the time This feature can be disabled by issuing set cuda single stepping optimizations off Faster Remote Debugging A lot of effort has gone into making remote debugging considerably faster up to 2 orde
49. o or detaching from a process Additionally if the environment variable CUDA DEVICE WAITS ON EXCEPTION is set to 1 prior to running the CUDA application the application will run normally until a device exception occurs The application will then wait for CUDA GDB to attach itself to it for further debugging By default on Ubuntu Linux debugger cannot attach to an already running processes In order to enable the attach feature of CUDA debugger either cuda gdb should be laucnhed as root or proc sys kernel yama ptrace scope should be set to zero using the following command sudo sh c echo 0 gt proc sys kernel yama ptrace scope To make the change permanent please edit etc sysctl d 10 ptrace conf 3 4 9 CUDA OpenGL Interop Applications on Linux Any CUDA application that uses OpenGL interoperability requires an active windows server Such applications will fail to run under console mode debugging on both Linux and Mac OS X However if the X server is running on Linux the render GPU will not be enumerated when debugging so the application could still fail unless the application uses the OpenGL device enumeration to access the render GPU But if the X session is running in non interactive mode while using the debugger the render GPU will be enumerated correctly 1 Launch your X session in non interactive mode a Stop your X server b Edit etc X11 xorg conf to contain the following line in the Device section corresponding
50. oint right after its module is loaded 3 4 4 Multi GPU Debugging in Console Mode CUDA GDB allows simultaneous debugging of applications running CUDA kernels on multiple GPUs In console mode CUDA GDB can be used to pause and debug every GPU in the system You can enable console mode as described above for the single GPU console mode 3 4 5 Multi GPU Debugging with the Desktop Manager Running This can be achieved by running the desktop GUI on one GPU and CUDA on the other GPU to avoid hanging the desktop GUI On Linux The CUDA driver automatically excludes the GPU used by X11 from being visible to the application being debugged This might alter the behavior of the application since if there are n GPUs in the system then only n 1 GPUs will be visible to the application On Mac OS X The CUDA driver exposes every CUDA capable GPU in the system including the one used by the Aqua desktop manager To determine which GPU should be used for CUDA run the 1 Utilities deviceQuery CUDA sample A truncated example output of deviceQuery is shown below www nvidia com CUDA Debugger DU 05227 042 v6 0 12 Getting Started Detected 2 CUDA Capable device s Device 0 GeForce GT 330M CUDA Driver Version Runtime Version 650 Gel CUDA Capability Major Minor version number 1 2 Total amount of global memory 512 MBytes 536543232 bytes 6 Multiprocessors x 8 CUDA Cores MP 48 CUDA Cores loon rummes OwJ 255 Device 1
51. on that since CUDA EXCEPTION 10 is a memory access error it must occur on code that accesses memory This happens on lines 11 12 16 17 and 18 so we set two autostep windows for those areas cuda gdb autostep 11 for 2 lines Breakpoint 1 at 0x796d18 file example cu line 11 Created autostep of length 2 lines cuda gdb autostep 16 for 3 lines Breakpoint 2 at 0x796e90 file example cu line 16 Created autostep of length 3 lines 3 Finally we run the program again with these autosteps cuda gdb run The program being debugged has been started already Start it from the beginning y or n y Termination of CUDA Kernel 0 example lt lt lt 8 1 1 64 1 1 gt gt gt on Device 0 Starting program home jitud cudagdb test autostep ex example Thread debugging using libthread db enabled New Thread Ox7ffff5688700 LWP 9089 Context Create of context 0x617270 on Device 0 Launch of CUDA Kernel 1 example lt lt lt 8 1 1 64 1 1 gt gt gt on Device 0 Syren FO ca CUDA Koreast 1 speed 1 lege 07 0 0 Jaasad 0 00 device 0 sm 0 warp 0 lane 0 SI Program received signal CUDA EXCEPTION 10 Device Illegal Address Current focus set to CUDA kernel 1 grid 1 block 3 0 0 thread 32 0 0 device 0 sm 1 warp 3 lane 0 Autostep precisely caught exception at example cu 16 0x796e90 This time we correctly caught the exception at line 16 Even though CUDA EXCEPTION 10 isa global
52. on the debug host For example cuda gdb could be configured to connect to remote target as follows cuda gdb set sysroot remote cuda gdb target remote 192 168 0 2 1234 Where 192 168 0 2 is the IP address or domain name of the remote target and 1234 is the TCP port previously previously opened by cuda gdbserver 3 4 7 Multiple Debuggers In a multi GPU environment several debugging sessions may take place simultaneously as long as the CUDA devices are used exclusively For instance one instance of CUDA GDB can debug a first application that uses the first GPU while another instance of CUDA GDB debugs a second application that uses the second GPU The exclusive use of a GPU is achieved by specifying which GPU is visible to the application by using the CUDA_VISIBLE_DEVICES environment variable CUDA VISIBLE DEVICES 1 cuda gdb my app With software preemption enabled set cuda software_preemption on multiple CUDA GDB instances can be used to debug CUDA applications context switching on the same GPU The cuda use lockfile 0 option must be used when starting each debug session as mentioned in cuda use lockfile www nvidia com CUDA Debugger DU 05227 042 _v6 0 14 Getting Started cuda gdb cuda use lockfile 0 my app 3 4 8 Attaching Detaching CUDA GDB can attach to and detach from a CUDA application running on GPUs with compute capability 2 0 and beyond using GDB s built in commands for attaching t
53. on the same GPU The options listed above are ignored for GPUs with less than SM3 5 compute capability 3 4 3 Multi GPU Debugging Multi GPU debugging designates the scenario where the application is running on more than one CUDA capable device Multi GPU debugging is not much different than single GPU debugging except for a few additional CUDA GDB commands that let you switch between the GPUs Any GPU hitting a breakpoint will pause all the GPUs running CUDA on that system Once paused you can use info cuda kernels to view all the active kernels and the GPUs they are running on When any GPU is resumed all the GPUs are resumed If the CUDA_VISIBLE_DEVICES environment is used only the specified devices are suspended and resumed www nvidia com CUDA Debugger DU 05227 042 _v6 0 11 Getting Started All CUDA capable GPUs may run one or more kernels To switch to an active kernel use cuda kernel lt n gt where n is the ID of the kernel retrieved from info cuda kernels The same kernel can be loaded and used by different contexts and devices at the same time When a breakpoint is set in such a kernel by either name or file name and line number it will be resolved arbitrarily to only one instance of that kernel With the runtime API the exact instance to which the breakpoint will be resolved cannot be controlled With the driver API the user can control the instance to which the breakpoint will be resolved to by setting the breakp
54. ons on The debugger will use safe techniques to accelerate single stepping This is the default starting with the 6 0 release 12 8 set cuda thread selection When the debugger must choose an active thread to focus on the decision is guided by a heuristics The set cuda thread selection guides those heuristics cuda gdb set cuda thread selection logical The thread with the lowest blockIdx threadIdx coordinates is selected cuda gdb set cuda thread selection physical The thread with the lowest dev sm warp lane coordinates is selected 12 9 set cuda value extrapolation Before accessing the value of a variable the debugger checks whether the variable is live or not at the current PC On CUDA devices the variables may not be live all the time and will be reported as Optimized Out CUDA GDB offers an option to opportunistically circumvent this limitation by extrapolating the value of a variable when the debugger would otherwise mark it as optimized out The extrapolation is not guaranteed to be accurate and must be used carefully If the register that was used to store the value of a variable has been reused since the last time the variable was seen as live then the reported value will be wrong Therefore any value printed using the option will be marked as possibly cuda gdb set cuda value extrapolation off The debugger only read the value of live variables This setting is the default and is always safe
55. outine in the device code gt Requesting to read or write GPU memory may be unsuccessful if the size is larger than 100MB on Tesla GPUs and larger than 32MB on Fermi GPUs On GPUs with sm 20 if you are debugging code in device functions that get called by multiple kernels then setting a breakpoint in the device function will insert the breakpoint in only one of the kernels Ina multi GPU debugging environment on Mac OS X with Agua running you may experience some visible delay while single stepping the application gt Setting a breakpoint on a line withina device or global function before its module is loaded may result in the breakpoint being temporarily set on the first line of a function below in the source code As soon as the module for the targeted function is loaded the breakpoint will be reset properly In the meantime the breakpoint may be hit depending on the application In those situations the breakpoint can be safely ignored and the application can be resumed gt Thescheduler locking option cannot be set to on gt Stepping again after stepping out of a kernel results in undetermined behavior It is recommended to use the continue command instead gt To debug CUDA application that uses OpenGL X server may need to be launched in non interactive mode See CUDA OpenGL Interop Applications on Linux for details gt Pretty printing is not supported gt When remotely debugging 32 bit applications on a
56. r debugging CUDA applications running on actual hardware This enables developers to debug applications without the potential variations introduced by simulation and emulation environments CUDA GDB runs on Linux and Mac OS X 32 bit and 64 bit CUDA GDB is based on GDB 7 2 on both Linux and Mac OS X 1 2 Supported Features CUDA GDB is designed to present the user with a seamless debugging environment that allows simultaneous debugging of both GPU and CPU code within the same application Just as programming in CUDA C is an extension to C programming debugging with CUDA GDB is a natural extension to debugging with GDB The existing GDB debugging features are inherently present for debugging the host code and additional features have been provided to support debugging CUDA device code CUDA GDB supports C and C CUDA applications All the C features supported by the NVCC compiler can be debugged by CUDA GDB CUDA GDB allows the user to set breakpoints to single step CUDA applications and also to inspect and modify the memory and variables of any given thread running on the hardware CUDA GDB supports debugging all CUDA applications whether they use the CUDA driver API the CUDA runtime API or both www nvidia com CUDA Debugger DU 05227 042 v6 0 1 Introduction CUDA GDB supports debugging kernels that have been compiled for specific CUDA architectures such as sm 10 or sm 20 but also supports debugging kernels compiled at ru
57. read x y z orthread x y z gt breakpoint all and breakpoint n where n x y z are integers or one of the following special keywords current any and all current indicates that the corresponding value in the current focus should be used any and all indicate that any value is acceptable The breakpoint all and breakpoint n filter are only effective for the info cuda threads command 8 4 1 info cuda devices This command enumerates all the GPUs in the system sorted by device index A indicates the device currently in focus This command supports filters The default is device all This command prints No CUDA Devices if no GPUs are found cuda gdb info cuda devices Dev Description SM Type SMs Warps SM Lanes Warp Max Regs Lane Active SMs Mask aO gt200 sm 13 24 92 32 128 OxOOffffff 8 4 2 info cuda sms This command shows all the SMs for the device and the associated active warps on the SMs This command supports filters and the default is device current sm all A indicates the SM is focus The results are grouped per device cuda gdb info cuda sms SM Active Warps Mask Device O0 oe 0 OSI IIE ETE E JE dE E de dE EREE JL SETE EEEE ETE EIE dE IE TE I6 RE 2 QE EAE IE IE IE TEE AE IE dE JE IE EAE JE USTE TEHE TENE ETE EIE dE E TS Ie aE te AL OmcititiEititit tit JE VE E IE E YE dE 4E D REE IE IE EE dE TE dE TE TE YE dE E amp UEcitititititit E VE VE YE E dE YE TE dE E VE OSSE EGE GENER EE d E dE V
58. ror Checking Table 1 CUDA Exception Codes Exception Code Precision Scope of Description of the the Error Error CUDA EXCEPTION 0 Device Not precise Global error This is a global GPU error caused by Unknown Exception on the GPU the application which does not match any of the listed error codes below This should be a rare occurrence Potentially this may be due to Device Hardware Stack overflows or a kernel generating an exception very close to its termination CUDA EXCEPTION 1 Lane Precise Per lane This occurs when a thread accesses an Illegal Address Requires thread error illegal out of bounds global address memcheck on CUDA EXCEPTION 2 Lane Precise Per lane This occurs when a thread exceeds its User Stack Overflow thread error stack memory limit CUDA EXCEPTION 3 Device Not precise Global error This occurs when the application Hardware Stack Overflow on the GPU triggers a global hardware stack overflow The main cause of this error is large amounts of divergence in the presence of function calls CUDA EXCEPTION 4 Warp Not precise Warp error This occurs when any thread within Illegal Instruction a warp has executed an illegal instruction CUDA EXCEPTION 5 Warp Not precise Warp error This occurs when any thread within a Out of range Address warp accesses an address that is outside the valid range of local or shared memory regions CUDA EXCEPTION 6 Warp Not prec
59. rs of magnitude The effort also made local debugging faster Kernel Entry Breakpoints The set cuda break on launch option will now break on kernels launched from the GPU Also enabling this option does not affect kernel launch notifications Live Range Optimizations To mitigate the issue of variables not being accessible at some code addresses the debugger offers two new options With set cuda value extrapolation the latest known value is displayed with possibly prefix With set cuda www nvidia com CUDA Debugger DU 05227 042 v6 0 3 Release Notes ptx cache the latest known value of the PTX register associated with a source variable is displayed with the cached prefix Event Notifications Kernel event notifications are not displayed by default any more New kernel events verbosity options have been added set cuda kernel events set cuda kernel events depth Also set cuda defer kernel launch notifications has been deprecated and has no effect any more 5 5 Release Kernel Launch Trace Two new commands info cuda launch trace and info cuda launch children are introduced to display the kernel launch trace and the children kernel of a given kernel when Dynamic Parallelism is used Single GPU Debugging BETA CUDA GDB can now be used to debug a CUDA application on the same GPU that is rendering the desktop GUI This feature also enables debugging of long running or indefinite CUDA kernels that would otherwise encounter
60. s default cuda gdb set cuda defer kernel launch notifications on CUDA GDB defers receiving information about kernel launches set cuda defer kernel launch notifications option is deprecated and has no effect any more www nvidia com CUDA Debugger DU 05227 042 v6 0 34 Chapter 10 AUTOMATIC ERROR CHECKING 10 1 Checking API Errors CUDA GDB can automatically check the return code of any driver API or runtime API call If the return code indicates an error the debugger will stop or warn the user The behavior is controlled with the set cuda api failures option Three modes are supported gt hide will not report any error of any kind gt ignore will emit a warning but continue the execution of the application default gt stop will emit an error and stop the application The success return code and other non error return codes are ignored For the driver API those are CUDA SUCCESS and CUDA ERROR NOT READY For the runtime API they are cudaSuccess and cudaErrorNotReady 10 2 GPU Error Reporting With improved GPU error reporting in CUDA GDB application bugs are now easier to identify and easy to fix The following table shows the new errors that are reported on GPUs with compute capability sm 20 and higher Continuing the execution of your application after these errors are found can lead to application termination or indeterminate results www nvidia com CUDA Debugger DU 05227 042 v6 0 35 Automatic Er
61. se Reading symbols for shared libraries Breakpoint 1 main at bitreverse cu 25 25 woayel el NUL shoe aif 5 At this point commands can be entered to advance execution or to print the program state For this walkthrough let s continue until the device kernel is launched cuda gdb continue Continuing Reading symbols for shared libraries done Reading symbols for shared libraries done Context Create of context 0x80f200 on Device 0 Launch of CUDA Kernel 0 bitreverse lt lt lt 1 1 1 256 1 1 gt gt gt on Device 0 Breakpoint 3 at 0x8667b8 file bitreverse cu line 21 ISmateelatms r ocuUs Te GUDA Karaat 0 ger Oe OPE OP OD ts Ens ever 070 0 7 device 0 sm 0 warp 0 lane 0 Breakpoint 2 bitreverse lt lt lt 1 1 1 256 1 1 gt gt gt data 0x110000 at bitreverse cu 9 9 unsigned int idata unsigned int data CUDA GDB has detected that a CUDA device kernel has been reached The debugger prints the current CUDA thread of focus 6 Verify the CUDA thread of focus with the info cuda threads command and switch between host thread and the CUDA threads www nvidia com CUDA Debugger DU 05227 042 _v6 0 42 Walk Through Examples cuda gdb info cuda threads BlockIdx ThreadIdx To BlockIdx ThreadIdx Count Valegmel EE Filename Line Kernel 0 09 0 0 KOTA 0903 0510 00 253 0 0 256 0x0000000000866400 bitreverse cu 9 cuda gdb thread Current thread is 1 process 16738 cuda gd
62. ted one for each instance of the templatized code 7 3 Address Breakpoints To set a breakpoint at a specific address use the break command with the address as argument cuda gdb break 0xlafe34d0 The address can be any address on the device or the host 7 4 Kernel Entry Breakpoints To break on the first instruction of every launched kernel set thebreak on launch option to application cuda gdb set cuda break on launch application See set cuda break on launch for more information 7 5 Conditional Breakpoints To make the breakpoint conditional use the optional if keyword or the cond command cuda gdb break foo cu 23 if threadIdx x 1 amp amp i lt 5 cuda gdb cond 3 threadIdx x 1 amp amp i lt 5 Conditional expressions may refer any variable including built in variables such as threadIdx and blockIdx Function calls are not allowed in conditional expressions Note that conditional breakpoints are always hit and evaluated but the debugger reports the breakpoint as being hit only if the conditional statement is evaluated to TRUE The process of hitting the breakpoint and evaluating the corresponding conditional statement is time consuming Therefore running applications while using conditional breakpoints may slow down the debugging session Moreover if the conditional statement is always evaluated to FALSE the debugger may appear to be hanging or stuck although it is not the case You can interrupt
63. th the CUDA VISIBLE DEVICES environment variable New Autostep Command A new autostep command was added The command increases the precision of CUDA exceptions by automatically single stepping through portions of code Under normal execution the thread and instruction where an exception occurred may be imprecisely reported However the exact instruction that generates the www nvidia com CUDA Debugger DU 05227 042 _v6 0 5 Release Notes exception can be determined if the program is being single stepped when the exception occurs Manually single stepping through a program is a slow and tedious process Therefore autostep aides the user by allowing them to specify sections of code where they suspect an exception could occur These sections are automatically single stepped through when the program is running and any exception that occurs within these sections is precisely reported Type help autostep from CUDA GDB for the syntax and usage of the command Multiple Context Support On GPUs with compute capability of SM20 or higher debugging multiple contexts on the same GPU is now supported It was a known limitation until now Device Assertions Support The R285 driver released with the 4 1 version of the toolkit supports device assertions CUDA GDB supports the assertion call and stops the execution of the application when the assertion is hit Then the variables and memory can be inspected as usual The application can also
64. the TAB key as with any other GDB command 4 3 Initialization File The initialization file for CUDA GDB is named cuda gdbinit and follows the same rules as the standard gdbinit file used by GDB The initialization file may contain any www nvidia com CUDA Debugger DU 05227 042 v6 0 16 CUDA GDB Extensions CUDA GDB command Those commands will be processed in order when CUDA GDB is launched 4 4 GUI Integration Emacs CUDA GDB works with GUD in Emacs and XEmacs No extra step is required other than pointing to the right binary To use CUDA GDB the gud gdb command name variable must be set to cuda gdb annotate 3 Use M x customize variable to set the variable Ensure that cuda gdb is present in the Emacs XEmacs PATH DDD CUDA GDB works with DDD To use DDD with CUDA GDB launch DDD with the following command ddd debugger cuda gdb cuda gdb must be in your PATH www nvidia com CUDA Debugger DU 05227 042 v6 0 17 Chapter 5 KERNEL FOCUS A CUDA application may be running several host threads and many device threads To simplify the visualization of information about the state of application commands are applied to the entity in focus When the focus is set to a host thread the commands will apply only to that host thread unless the application is fully resumed for instance On the device side the focus is always set to the lowest granularity level the device thread 5 1 Software Coordinates
65. the application with CTRL C to verify that progress is being made Conditional breakpoints can be set on code from CUDA modules that are not already loaded The verification of the condition will then only take place when the ELF image of www nvidia com CUDA Debugger DU 05227 042 v6 0 23 Breakpoints amp Watchpoints that module is loaded Therefore any error in the conditional expression will be deferred from the instantion of the conditional breakpoint to the moment the CUDA module is loaded If unsure first set an unconditional breakpoint at the desired location and add the conditional statement the first time the breakpoint is hit by using the cond command 7 6 Watchpoints Watchpoints on CUDA code are not supported Watchpoints on host code are supported The user is invited to read the GDB documentation for a tutorial on how to set watchpoints on host code www nvidia com CUDA Debugger DU 05227 042 v6 0 24 Chapter 8 INSPECTING PROGRAM STATE 8 1 Memory and Variables The GDB print command has been extended to decipher the location of any program variable and can be used to display the contents of any CUDA program variable including gt data allocated via cudaMalloc gt data that resides in various GPU memory regions such as shared local and global memory gt special CUDA runtime variables such as threadIdx 8 2 Variable Storage and Accessibility Depending on the variable type and usage varia
66. to your display Option Interactive off c Restart your X server 2 Log in remotely SSH etc and launch your application under CUDA GDB This setup works properly for single GPU and multi GPU configurations 3 Ensure your DISPLAY environment variable is set appropriately For example export DISPLAY 0 0 While X is in non interactive mode interacting with the X session can cause your debugging session to stall or terminate www nvidia com CUDA Debugger DU 05227 042 v6 0 15 Chapter 4 CUDA GDB EXTENSIONS 4 1 Command Naming Convention The existing GDB commands are unchanged Every new CUDA command or option is prefixed with the CUDA keyword As much as possible CUDA GDB command names will be similar to the equivalent GDB commands used for debugging host code For instance the GDB command to display the host threads and switch to host thread 1 are respectively cuda gdb info threads cuda gdb thread 1 To display the CUDA threads and switch to cuda thread 1 the user only has to type cuda gdb info cuda threads cuda gdb cuda thread 1 4 2 Getting Help As with GDB commands the built in help for the CUDA commands is accessible from the cuda gdb command line by using the help command cuda gdb help cuda name of the cuda command cuda gdb help set cuda name of the cuda option cuda gdb help info cuda name of the info cuda command Moreover all the CUDA commands can be auto completed by pressing
67. us To switch the current focus use the cuda command followed by the coordinates to be changed cuda gdb cuda device 0 sm 1 warp 2 lane 3 Switching focus to CUDA kernel 1 grid 2 block 8 0 0 thread 67 0 0 device 0 sm 1 warp 2 lane 3 374 int totalThreads gridDim x blockDim x If the specified focus is not fully defined by the command the debugger will assume that the omitted coordinates are set to the coordinates in the current focus including the subcoordinates of the block and thread cuda gdb cuda thread 15 Switching focus to CUDA kernel 1 grid 2 block 8 0 0 thread 15 0 0 device 0 sm 1 warp 0 lane 15 374 int totalThreads gridDim x blockDim x The parentheses for the block and thread arguments are optional cuda gdb cuda block 1 thread 3 ISmaeelalag oeus Te GUD kersast 1 grade 2 lolegk 17 0 0 Tarsad 9959 i device 0 sm 3 warp 0 lane 3 374 int totalThreads gridDim x blockDim www nvidia com CUDA Debugger DU 05227 042 v6 0 19 Chapter 6 PROGRAM EXECUTION Applications are launched the same way in CUDA GDB as they are with GDB by using the run command This chapter describes how to interrupt and single step CUDA applications 6 1 Interrupting the Application If the CUDA application appears to be hanging or stuck in an infinite loop it is possible to manually interrupt the application by pressing CTRL C When the signal is received the GPUs are suspended and t
68. utostep is encountered while another autostep is being executed then the second autostep is ignored If an autostep is set before the location of a memory error and no memory error is hit then it is possible that the chosen window is too small This may be caused by the presence of function calls between the address of the autostep location and the instruction that triggers the memory error In that situation either increase the size of the window to make sure that the faulty instruction is included or move to the autostep location to an instruction that will be executed closer in time to the faulty instruction Autostep requires Fermi GPUs or above Related Commands Autosteps and breakpoints share the same numbering so most commands that work with breakpoints will also work with autosteps info autosteps shows all breakpoints and autosteps It is similar to info breakpoints cuda gdb info autosteps Num Type Disp Enb Address What 1 autostep keep y 0x0000000000401234 in merge at sort cu 30 for 49 SE eE One 3 autostep keep y 0x0000000000489913 in bubble at sort cu 94 for 11 lines disable autosteps disables an autostep It is equivalent to disable breakpoints n delete autosteps n deletes an autostep It is quivalent to delete breakpoints n ignore n i tells the debugger to not single step the next i times the debugger enters the window for autostep n This command already exists for breakpoints www nvidia com CUDA D
69. y gt lt string gt usr libexec taskgated lt string gt lt string gt p lt String gt lt string gt s lt string gt lt array gt LIES SSI gt After editing the file the system must be rebooted or the daemon stopped and relaunched for the change to take effect Using the taskgated as every application in the procmod group will have higher priviledges adding the p option to the taskgated daemon is a possible security risk Debugging in the console mode While debugging the application in console mode it is not uncommon to encounter kernel warnings about unnesting DYLD shared regions for a debugger or a debugged process that look as follows www nvidia com CUDA Debugger DU 05227 042 _v6 0 8 Getting Started cuda binary gdb map Oxffffff8038644658 triggered DYLD shared region unnest for map Oxffffff8038644bc8 region 0Ox7fff95e00000 20x7ff 96000000 While not abnormal for debuggers this increases system memory footprint until the target exits To prevent such messages from appearing make sure that the vm shared region unnest logging kernel parameter is set to zero for example by using the following command sudo sysctl w vm shared region unnest logging 0 3 2 3 Temporary Directory By default CUDA GDB uses tmp as the directory to store temporary files To select a different directory set the TMPDIR environment variable The user must have write and execute permission to the temporary director
70. y is controlled with kernel events and kernel events depth options gt cuda gdb set cuda kernel events none Possible options are none no kernel application or system default application kernel launched by the user application system any kernel launched by the driver such as memset all any kernel application and system cuda gdb set cuda kernel events depth 0 Controls the maximum depth of the kernels after which no kernel event notifications will be displayed A value of zero means that there is no maximum and that all the kernel notifications are displayed A value of one means that the debugger will display kernel event notifications only for kernels launched from the CPU default In addition to displaying kernel events the underlying policy used to notify the debugger about kernel launches can be changed By default kernel launches cause events that CUDA GDB will process If the application launches a large number of kernels it is preferable to defer sending kernel launch notifications until the time the debugger stops the application At this time only the kernel launch notifications for kernels that are valid on the stopped devices will be displayed In this mode the debugging session will run a lot faster The deferral of such notifications can be controlled with the defer kernel launch notifications option cuda gdb set cuda defer kernel launch notifications off CUDA GDB receives events on kernel launche
71. y used by CUDA GDB Otherwise the debugger will fail with an internal error 3 3 Compiling the Application 3 3 1 Debug Compilation NVCC the NVIDIA CUDA compiler driver provides a mechanism for generating the debugging information necessary for CUDA GDB to work properly The g G option pair must be passed to NVCC when an application is compiled in order to debug with CUDA GDB for example nvcc g G foo cu o foo Using this line to compile the CUDA application oo cu gt forces 00 compilation with the exception of very limited dead code eliminations and register spilling optimizations gt makes the compiler include debug information in the executable 3 3 2 Compiling For Specific GPU architectures By default the compiler will only generate PTX code for the compute 10 virtual architecture Then at runtime the kernels are recompiled for the GPU architecture of the target GPU s Compiling for a specific virtual architecture guarantees that the application will work for any GPU architecture after that for a trade off in performance This is done for forward compatibility It is highly recommended to compile the application once and for all for the GPU architectures targeted by the application and to generate the PTX code for the latest virtual architecture for forward compatibility A GPU architecture is defined by its compute capability The list of GPUs and their respective compute capability see https developer nvi

CUDA Debugger

Contents

Download Pdf Manuals

Related Search

Related Contents