Home

CUDA Debugger

1. 1 amp amp i lt 5 www nvidia com CUDA Debugger DU 05227 042 _v5 5 22 Breakpoints amp Watchpoints Conditional expressions may refer any variable including built in variables such as threadIdx and blockIdx Function calls are not allowed in conditional expressions Note that conditional breakpoints are always hit and evaluated but the debugger reports the breakpoint as being hit only if the conditional statement is evaluated to TRUE The process of hitting the breakpoint and evaluating the corresponding conditional statement is time consuming Therefore running applications while using conditional breakpoints may slow down the debugging session Moreover if the conditional statement is always evaluated to FALSE the debugger may appear to be hanging or stuck although it is not the case You can interrupt the application with CTRL C to verify that progress is being made Conditional breakpoints can be set on code from CUDA modules that are not already loaded The verification of the condition will then only take place when the ELF image of that module is loaded Therefore any error in the conditional expression will be deferred from the instantion of the conditional breakpoint to the moment the CUDA module is loaded If unsure first set an unconditional breakpoint at the desired location and add the conditional statement the first time the breakpoint is hit by using the cond command 7 6 Watchpoints Watchpoints on CUDA code
2. 64 1 1 gt gt gt on Device 0 Starting program home jitud cudagdb test autostep ex example Thread debugging using libthread db enabled New Thread Ox7ffff5688700 LWP 9089 Context Create of context 0x617270 on Device 0 Launch of CUDA Kernel 1 example lt lt lt 8 1 1 64 1 1 gt gt gt on Device 0 Suee Mino ees TE CUDA Korast 1 xejesuel 1 Jeudbexels 0 9 10 5 duede 4 10 0 7 device 0 sm 0 warp 0 lane 0 2 Program received signal CUDA EXCEPTION 10 Device Illegal Address Current focus set to CUDA kernel 1 grid 1 block 3 0 0 thread 32 0 0 device 0 sm 1 warp 3 lane 0 Autostep precisely caught exception at example cu 16 0x796e90 This time we correctly caught the exception at line 16 Even though CUDA EXCEPTION 10isa global error we have now narrowed it down to a warp error so we now know that the thread that threw the exception must have been in the same warp as block 3 thread 32 In this example we have narrowed down the scope of the error from 512 threads down to 32 threads just by setting two autosteps and re running the program 12 3 Example 3 Debugging an MPI CUDA Application For doing large MPI CUDA application debugging NVIDIA recommends using parallel debuggers supplied by our partners Allinea and Totalview Both make excellent parallel debuggers with extended support for CUDA However for debugging smaller applications or for debugging just a f
3. CUDA VISIBLE DEVICES 1 cuda gdb my app With software preemption enabled set cuda software preemption on multiple CUDA GDB instances can be used to debug CUDA applications context switching on the same GPU The cuda use lockfile 0 option must be used when starting each debug session as mentioned in Lock File cuda gdb cuda use lockfile 0 my app 3 4 8 Attaching Detaching CUDA GDB can attach to and detach from a CUDA application running on GPUs with compute capability 2 0 and beyond using GDB s built in commands for attaching to or detaching from a process Additionally if the environment variable CUDA DEVICE WAITS ON EXCEPTION is set to 1 prior to running the CUDA application the application will run normally until a device exception occurs The application will then wait for CUDA GDB to attach itself to it for further debugging 3 4 9 CUDA OpenGL Interop Applications on Linux Any CUDA application that uses OpenGL interoperability reguires an active windows server Such applications will fail to run under console mode debugging on both Linux www nvidia com CUDA Debugger DU 05227 042 v5 5 13 Getting Started and Mac OS X However if the X server is running on Linux the render GPU will not be enumerated when debugging so the application could still fail unless the application uses the OpenGL device enumeration to access the render GPU But if the X session is running in non interactive mode while using the debugge
4. cuda gdb help cuda name of the cuda command cuda gdb help set cuda name of the cuda option cuda gdb help info cuda name of the info cuda command Moreover all the CUDA commands can be auto completed by pressing the TAB key as with any other GDB command 4 3 Initialization File The initialization file for CUDA GDB is named cuda gdbinit and follows the same rules as the standard gdbinit file used by GDB The initialization file may contain any www nvidia com CUDA Debugger DU 05227 042 v5 5 15 CUDA GDB Extensions CUDA GDB command Those commands will be processed in order when CUDA GDB is launched 4 4 GUI Integration Emacs CUDA GDB works with GUD in Emacs and XEmacs No extra step is required other than pointing to the right binary To use CUDA GDB the gud gdb command name variable must be set to cuda gdb annotate 3 Use M x customize variable to set the variable Ensure that cuda gdb is present in the Emacs XEmacs PATH DDD CUDA GDB works with DDD To use DDD with CUDA GDB launch DDD with the following command ddd debugger cuda gdb cuda gdb must be in your PATH www nvidia com CUDA Debugger DU 05227 042 _v5 5 16 Chapter 5 KERNEL FOCUS A CUDA application may be running several host threads and many device threads To simplify the visualization of information about the state of application commands are applied to the entity in focus When the focus is set to a host thread the com
5. software preemption set cuda software preemption on to debug multiple CUDA applications context switching on the same GPU www nvidia com CUDA Debugger DU 05227 042 _v5 5 49 Appendix A SUPPORTED PLATFORMS The general platform and GPU requirements for running NVIDIA CUDA GDB are described in this section A 1 Host Platform Requirements Mac OS CUDA GDB is supported on both 32 bit and 64 bit editions of the following Mac OS versions gt MacOS X 10 7 gt Mac OS X 10 8 Linux CUDA GDB is supported on both 32 bit and 64 bit editions of the following Linux distributions gt Red Hat Enterprise Linux 5 5 64 bit only Red Hat Enterprise Linux 6 x gt Ubuntu 12 04 and 12 10 Fedora 18 gt OpenSuse 122 gt Suse Linux Enterprise Server 11 1 and 11 SP2 GPU Requirements Debugging is supported on all CUDA capable GPUs with a compute capability of 1 1 or later Compute capability is a device attribute that a CUDA application can query about for more information see the latest NVIDIA CUDA Programming Guide on the NVIDIA CUDA Zone Web site http developer nvidia com object gpucomputing html www nvidia com CUDA Debugger DU 05227 042 _v5 5 50 These GPUs have a compute capability of 1 0 and are not supported gt GeForce 8800 GTS gt GeForce 8800 GTX gt GeForce 8800 Ultra gt Quadro Plex 1000 Model IV gt Quadro Plex 2100 Model S4 gt Quadro FX 4600 gt Quadro FX 5600
6. 0 3 0 0 0x0000000000948e58 011 0 3 infoCommands cu 12 COTO 4 0 0 0x0000000000948e58 011 0 4 infoCommands cu 1152 COTO 5 0 0 0x0000000000948e58 011 0 5 infoCommands cu 12 cuda gdb info cuda threads breakpoint 2 lane 1 BlockIdx ThreadIdx Virtual PC Dev SM Wp Ln Filename Line Kernel 0 CROTO 1 0 0 0x0000000000948e58 011 0 1 infoCommands cu 12 8 4 8 info cuda launch trace This command displays the kernel launch trace for the kernel in focus The first element in the trace is the kernel in focus The next element is the kernel that launched this kernel The trace continues until there is no parent kernel In that case the kernel is CPU launched www nvidia com CUDA Debugger DU 05227 042 v5 5 29 Inspecting Program State For each kernel in the trace the command prints the level of the kernel in the trace the kernel ID the device ID the grid Id the status the kernel dimensions the kernel name and the kernel arguments cuda gdb info cuda launch trace Lvl Kernel Dev Grid Status GridDim BlockDim Invocation 0 3 0 zy xoa S 17 1 16 1 1 kernel3 c 5 i 2 0 5 Terminated 240 1 1 128 1 1 kernel2 b 3 2 il 0 2 Active 240 1 1 128 1 1 kernell a 1 A kernel that has been launched but that is not running on the GPU will have a Pending status A kernel currently running on the GPU will be marked as Active A kernel waiting to become active again will be displayed as Sleeping When a kernel has terminated it is mar
7. 64 www nvidia com CUDA Debugger DU 05227 042 _v5 5 24 Inspecting Program State The example below shows how to access the starting address of the input parameter to the kernel cuda gdb print amp data 6 const global void const parameter 0x10 cuda gdb print Gglobal void const Gparameter 0x10 7 global void const Gparameter 0x110000 lt gt 8 3 Inspecting Textures The debugger can always read write the source variables when the PC is on the first assembly instruction of a source instruction When doing assembly level debugging the value of source variables is not always accessible To inspect a texture use the print command while de referencing the texture recast to the type of the array it is bound to For instance if texture tex is bound to array A of type float use cuda gdb print texture float tex All the array operators such as can be applied to texture float tex cuda gdb print texture float tex 2 cuda gdb print texture float tex 2 4 8 4 Info CUDA Commands These are commands that display information about the GPU and the application s CUDA state The available options are devices information about all the devices sms information about all the SMs in the current device warps information about all the warps in the current SM lanes information about all the lanes in the current warp kernels information about all the active kernels b
8. GPU Debugging with the Desktop Manager Running eeeeeeeeee 10 3 4 3 Multi GPU Deb gging i e euvesev ma tees se sean e e ea s na enhn inn near rrr ee sai e an Rae ras 10 3 4 4 Multi GPU Debugging in Console Mode eeeeereenneenenenaeeneeeenee 11 3 4 5 Multi GPU Debugging with the Desktop Manager Running neeeeeeeeee 11 3 4 6 Remote Debugging iusvensssn cea entente ro tan hear RR ESTE RERRR ERA E EFE kavade van iied eit 12 3 4 7 Multiple Debtuggers coo ies kare Regu ek klaarida amil E ROREM eSI MUR 13 3 4 8 Attaching Detachilig eere roro rester oce tare o cree demand aavu kindad ad 13 3 4 9 CUDA OpenGL Interop Applications on Linux eeeeeeeveeeneeneeeeee enne 13 Chapter 4 CUDA GDB EXEtehsiOriSc 3251 90202 eda a coorta aa ecu d O2 eo ois D a ean Os ame aa 15 4 1 Command Naming Convention ceeseesssesssesseeeee eene hene ee enne 15 4 2 Getting Help esrsce serre exe ka ses viva chien e EN EROR taan seen dues YN ER VERFU va mata 15 4 3 Initialization File z siete rn ren mimi evans aa VI Enea ER ERAN EIE NU E ERE RERO KNEE Ken 15 4 4 GUI IntegratiOn i eene tero ttt nont hr per vh cnn rn exce xa se vx II ilma ERR e VR es 16 Chapter 5 Kernel FOoC s eoo eee rennes d sn aar c cosa v sean asso ny taas t 17 5 1 Software Coordinates vs Hardware Coordinates eeeeveneenneeneeeoee
9. Using the Debugger Debugging a CUDA GPU involves pausing that GPU When the graphics desktop manager is running on the same GPU then debugging that GPU freezes the GUI and makes the desktop unusable To avoid this use CUDA GDB in the following system configurations 3 4 1 Single GPU Debugging In a single GPU system CUDA GDB can be used to debug CUDA applications only if no X11 server on Linux or no Aqua desktop manager on Mac OS X is running on that system On Linux On Linux you can stop the X11 server by stopping the gdm service On Mac OS X On Mac OS X you can log in with gt console as the user name in the desktop UI login screen To enable console login option open the System Prerences gt Users amp Group gt Login Options tab set automatic login option to Off and set Display login window as to Name and password To launch debug cuda applications in console mode on systems with an integrated GPU and a discrete GPU also make sure that the Automatic Graphics Switching option in the System Settings gt Energy Saver tab is unchecked www nvidia com CUDA Debugger DU 05227 042 v5 5 9 Getting Started 3 4 2 Single GPU Debugging with the Desktop Manager Running CUDA GDB can be used to debug CUDA applications on the same GPU that is running the desktop GUL This is a BETA feature available on Linux and supports devices with SM3 5 compute capability There are two ways to enable this functionality gt U
10. are not supported Watchpoints on host code are supported The user is invited to read the GDB documentation for a tutorial on how to set watchpoints on host code www nvidia com CUDA Debugger DU 05227 042 _v5 5 23 Chapter 8 INSPECTING PROGRAM STATE 8 1 Memory and Variables The GDB print command has been extended to decipher the location of any program variable and can be used to display the contents of any CUDA program variable including gt data allocated via cudaMalloc gt data that resides in various GPU memory regions such as shared local and global memory gt special CUDA runtime variables such as threadIdx 8 2 Variable Storage and Accessibility Depending on the variable type and usage variables can be stored either in registers or in local shared const or global memory You can print the address of any variable to find out where it is stored and directly access the associated memory The example below shows how the variable array which is of type shared int can be directly accessed in order to see what the stored values are in the array cuda gdb print amp array 1 shared int 0 0x20 cuda gdb print array 0 4 S2 10 128 64 192 You can also access the shared memory indexed into the starting offset to see what the stored values are cuda gdb print shared int 0x20 S cuda gdb print shared int 0x24 4 128 cuda gdb print shared int 0x28 S5
11. cu 9 7 Corroborate this information by printing the block and thread indexes cuda gdb print blockIdx SU e x 0 y 0 cuda gdb print threadIdx 2 1 0 v O a 0 8 The grid and block dimensions can also be printed cuda gdb print gridDim 93 8 1 y 15 cuda gdb print blockDim 4 x 256 y 1 z 1 9 Advance kernel execution and verify some data cuda gdb next LZ array threadIdx x idata threadIdx x cuda gdb next 14 array threadIdx x 0xf0f0f0f0 amp array threadIdx x gt gt 4 cuda gdb next 16 array threadIdx x Oxcccccccc amp array threadIdx x gt gt 2 cuda gdb next 18 array threadIdx x 0xaaaaaaaa amp array threadIdx x gt gt 1 cuda gdb next Breakpoint 3 bitreverse lt lt lt 1 1 256 1 1 gt gt gt data 0x100000 at bitreverse cu 21 21 idata threadIdx x array threadIdx x www nvidia com CUDA Debugger DU 05227 042 v5 5 42 Walk Through Examples cuda gdb print array 0 12 S7 10 128 64 192 32 160 98 224 16 144 BO 2081 cuda gdb print x array 0 12 a 00 OSSE PA AN C20 PEO 10x99 Ore O LAO Osc S07 OS 0x4d0 cuda gdb print amp data 9 global void parameter 0x10 cuda gdb print global void parameter 0x10 10 global void parameter 0x100000 The resulting output depends on the current content of the memory location 10Since thread 0 0 0 reverses the value
12. cuda coalescing off The following is the output of the same command when coalescing is turned off cuda gdb info cuda blocks BlockIdx State Dev SM Kernel 1 K KOMORO running 0 0 POTON running 0 3 2207109 running 0 6 OOT ON running 0 9 4 0 0 running 12 27 0 9 running G 15 6 0 0 running 13 770 0 running 21 COPI running 0 ji 8 4 7 info cuda threads This command displays the application s active CUDA blocks and threads with the total count of threads in those blocks Also displayed are the virtual PC and the associated source file and the line number information The results are grouped per kernel The command supports filters with default being kernel current block all thread all The outputs are coalesced by default as follows cuda gdb info cuda threads www nvidia com CUDA Debugger DU 05227 042 v5 5 28 Inspecting Program State BlockIdx ThreadIdx To BlockIdx ThreadIdx Count Virtual PC Filename Line Device 0 SM 0 0 0 0 0 0 0 97 10 0 31 07 0 32 0x000000000088 88c Beo eu SEG 0 0 0 32 0 0 191 0 0 127 0 0 24544 0x000000000088 800 acos cu 374 Coalescing can be turned off as follows in which case more information is displayed with the output cuda gdb info cuda threads BlockIdx ThreadIdx Virtual PC Dev SM Wp Ln Filename Line Kernel 1 0 00 0 0 0 0x000000000088f88c 0 8 0 C ge es 376 0 0 0 1 0 0 0x000000000088f88c 0 0 T acos cu 376 0 0 0 2 0 0 Ox00000
13. e uae Anita Due o I Dr M unU UID e mainid venir maaja 32 Chapter 10 Checking Memory Errors eeeeeeeeeeeee eee eene nee hehehe heres 34 10 1 Increasing the Precision of Memory Errors With Autostep veeeeeeeeee ees 34 TOA USAGES ERR 35 10 1 2 Related Commands cirea kraadi deka ena e Ra SUE ER ERE Rr SED RE n RO E PRI E TN 36 E er eR INTO AUFOSTED pePP m 36 10 1 2 2 disable autosteps Nissan eres E a Es tenete essere 36 10 1 2 3 delete autosteps Miitel reete een iras Eee eee Ro ele vad kski ed PER REM ERE TR 36 LUN BLA PI DICBLE ww 36 10 2 GPU Error REPO MING ses ooi ede eer xe eor E nn eeu aue RES Feud eae siiski s cm ck esa ERA EE 36 Chapter 11 Checking API EFTOIS int sie reete eie sane rana re nose spon cc pasce sad a seaascanaceccas 39 Chapter 12 Walk Through Examples cceeeeeeeee eene nennen enhn enhn erret 40 12 1 Example 1 bitreverse ecce cete onude a rer ide beeen E Ee LE Ere eX gae UN ree etiam 40 12 1 1 Walking through the Code 5 error e oen net eoe rey Hx koe er used ta v te 4 41 12 2 Example 2 autostep uss sw butts vaatn vog ve rbi su ve wer rU WU Y Dope PT EF us e big rds 43 12 2 1 Debugging with Autosteps errveeneneeeneneeenneneeae veena REACER EAEE 44 12 3 Example 3 Debugging an MPI CUDA Application eereerereneeeeeeeene 45 Chapte
14. is assumed that the user already knows the basic GDB commands used to debug host applications www nvidia com CUDA Debugger DU 05227 042 _v5 5 2 Chapter 2 RELEASE NOTES 5 5 Release Kernel Launch Stack Two new commands info cuda launch stack and info cuda launch children are introduced to display the kernel launch stack and the children kernel of a given kernel when Dynamic Parallelism is used Single GPU Debugging BETA CUDA GDB can now be used to debug a CUDA application on the same GPU that is rendering the desktop GUI This feature also enables debugging of long running or indefinite CUDA kernels that would otherwise encounter a launch timeout In addition multiple CUDA GDB sessions can debug CUDA applications context switching on the same GPU This feature is available on Linux with SM3 5 devices For information on enabling this please see Single GPU Debugging with the Desktop Manager Running and Multiple Debuggers Remote GPU Debugging CUDA GDB in conjunction with CUDA GDBSERVER can now be used to debug a CUDA application running on the remote host 5 0 Release Dynamic Parallelism Support CUDA GDB fully supports Dynamic Parallelism a new feature introduced with the 5 0 toolkit The debugger is able to track the kernels launched from another kernel and to inspect and modify variables like any other CPU launched kernel Attach Detach It is now possible to attach to a CUDA application that is already running It
15. needs to be notified The notification takes place in the form ofa signal being sent to a host thread The host thread to receive that special signal is determined with the set cuda notify option gt cuda gdb set cuda notify youngest The host thread with the smallest thread id will receive the notification signal default gt cuda gdb set cuda notify random An arbitrary host thread will receive the notification signal 13 2 Lock File When debugging an application CUDA GDB will suspend all the visible CUDA capable devices To avoid any resource conflict only one CUDA GDB session is allowed at a time To enforce this restriction CUDA GDB uses a locking mechanism implemented with a lock file That lock file prevents 2 CUDA GDB processes from running simultaneously However if the user desires to debug two applications simultaneously through two separate CUDA GDB sessions the following solutions exist gt Use the CUDA VISIBLE DEVICES environment variable to target unique GPUs for each CUDA GDB session This is described in more detail in Multiple Debuggers gt Lift the lockfile restriction by using the cuda use lockfile command line option cuda gdb cuda use lockfile 0 my app www nvidia com CUDA Debugger DU 05227 042 _v5 5 48 Advanced Settings This option is the recommended solution when debugging multiple ranks of an MPI application that uses separate GPUs for each rank It is also required when using
16. of 0 switch to a different thread to show more interesting data cuda gdb cuda thread 170 Switching focus to CUDA kernel 0 grid 1 block 0 0 0 thread 170 0 0 device 0 sm 0 warp 5 lane 10 11Delete the breakpoints and continue the program to completion cuda gdb delete breakpoints Delete all breakpoints y or n y cuda gdb continue Continuing Program exited normally cuda gdb 12 2 Example 2 autostep This section shows how to use the autostep command and demonstrates how it helps increase the precision of memory error reporting Source Code define NUM BLOCKS 8 define THREADS PER BLOCK 64 n 2 3 4 global void example int data 5 int valuel value2 value3 value4 valued 6 aas Hebe 12 alebesip 7 8 9 Lol plock idas PIOCEDIM XS idx2 threadIdx x 3 Adel p 41628 valuel data idx1 value2 data idx2 value3 valuel value2 value4 valuel value2 value5 value3 value4 data idx3 valued data idx1 value3 data idx2 value4 Taxi TA X2 idx3 0 XO O0 OY O1 4S CO No EP C N Ce N 22 ime Mean Geis eee ase EBSSNELI 4 23 alite host data NUM BLOCKS THREADS PER BLOCK 24 Hnc vallata 25 const int zero 0 26 www nvidia com CUDA Debugger DU 05227 042 v5 5 43 Walk Through Examples Allocate an integer for each thread in each block je e Jo 9 loleels isIbep NU
17. operating system MacOS X or Linux See Host Platform Requirements 3 Download and install the CUDA Driver 4 Download and install the CUDA Toolkit 3 2 Setting Up the Debugger Environment 3 2 1 Linux Set up the PATH and LD LIBRARY PATH environment variables export PATH usr local cuda 5 5 bin PATH export LD LIBRARY PATH usr local cuda 5 5 lib64 usr local cuda 5 5 lib LD LIBRARY PATH 3 2 2 Mac OS X Set up environment variables export PATH Developer NVIDIA CUDA 5 5 bin PATH export DYLD LIBRARY PATH Developer NVIDIA CUDA 5 5 1ib DYLD LIBRARY PATH www nvidia com CUDA Debugger DU 05227 042 v5 51 6 Getting Started Set permissions The first time cuda gdb is executed a pop up dialog window will appear to allow the debugger to take control of another process The user must have Administrator priviledges to allow it It is a required step Another solution used in the past is to add the cuda binary gdb to the procmod group and set the taskgated daemon to let such processes take control of other processes It used to be the solution to fix the Unable to find Mach task port for processid error sudo chgrp procmod Developer NVIDIA CUDA 5 5 bin cuda binary gdb sudo chmod 2755 Developer NVIDIA CUDA 5 5 bin cuda binary gdb sudo chmod 755 Developer NVIDIA CUDA 5 5 bin cuda gdb To set the taskgated daemon to allow the processes in the procmod group to access Task Ports taskgated must be launched with th
18. program is running and any exception that occurs within these sections is precisely reported Type help autostep from CUDA GDB for the syntax and usage of the command Multiple Context Support On GPUs with compute capability of SM20 or higher debugging multiple contexts on the same GPU is now supported It was a known limitation until now Device Assertions Support The R285 driver released with the 4 1 version of the toolkit supports device assertions CUDA_GDB supports the assertion call and stops the execution of the application when the assertion is hit Then the variables and memory can be inspected as usual The application can also be resumed past the assertion if needed Use the set cuda hide_internal_frames option to expose hide the system call frames hidden by default Temporary Directory By default the debugger API will use tmp as the directory to store temporary files To select a different directory the TMPDIR environment variable and the API CUDBG_APICLIENT_PID variable must be set www nvidia com CUDA Debugger DU 05227 042 v5 5 5 Chapter 3 GETTING STARTED Included in this chapter are instructions for installing CUDA GDB and for using NVCC the NVIDIA CUDA compiler driver to compile CUDA programs for debugging 3 1 Installation Instructions Follow these steps to install CUDA GDB 1 Visit the NVIDIA CUDA Zone download page http www nvidia com object cuda_get html 2 Select the appropriate
19. 0000088 88c 0 0 0 2 acos cu 376 0 0 0 3 0 0 0x000000000088f 88c 0 3 acos cu 376 0 0 0 4 0 0 0x000000000088f88c 0 0 4 acos cu SG 0 0 0 5 0 0 Ox000000000088 88c O0 3 acos cu 376 0 0 0 6 0 0 0x000000000088f88c 0 O0 GO amp acos cu 376 0 0 0 7 0 0 Ox000000000088 88c O0 U 7 acos cu 376 0 0 0 8 0 0 0x000000000088f88c 0 O0 E 8 acos cu 376 0407 0 9 0 0 0x000000000088 88c 0 0 9 FY acos cu 376 In coalesced form threads must be contiguous in order to be coalesced If some threads are not currently running on the hardware they will create holes in the thread ranges For instance if a kernel consist of 2 blocks of 16 threads and only the 8 lowest threads are active then 2 coalesced ranges will be printed one range for block 0 thread 0 to 7 and one range for block 1 thread 0 to 7 Because threads 8 15 in block 0 are not running the 2 ranges cannot be coalesced The command also supports breakpoint all and breakpoint breakpoint number as filters The former displays the threads that hit all CUDA breakpoints set by the user The latter displays the threads that hit the CUDA breakpoint breakpoint_number cuda gdb info cuda threads breakpoint all BlockIdx ThreadIdx Virtual PC Dev SM Wp Ln Filename Line Kernel 0 1 07 9 0 0 0 0x0000000000948e58 O O MO Commands gen 12 11 8 9 1 0 0 0x0000000000948e58 OMI antocComnands cu 12 1 9 9 2 0 0 0x0000000000948e58 011 0 2 infoCommands cu 12 11 7
20. 227 042 _v5 5 41 Walk Through Examples Switching focus to CUDA kernel 0 grid 1 block 0 0 0 thread 0 0 0 device 0 sm 0 warp 0 lane 0 Breakpoint 2 bitreverse lt lt lt 1 1 1 256 1 1 gt gt gt data 0x110000 at bitreverse cu 9 9 unsigned int idata unsigned int data CUDA GDB has detected that a CUDA device kernel has been reached The debugger prints the current CUDA thread of focus 6 Verify the CUDA thread of focus with the info cuda threads command and switch between host thread and the CUDA threads cuda gdb info cuda threads BlockIdx ThreadIdx To BlockIdx ThreadIdx Count wWabsexewbveldl Ite Filename Line Kernel 0 10 10 0 OOO 07 07 2537070 256 0x0000000000866400 bitreverse cu 9 cuda gdb thread Current thread is 1 process 16738 cuda gdb thread 1 Switching to thread 1 process 16738 0 0x000019d5 in main at bitreverse cu 34 34 bitreverse 1 N N sizeof int d cuda gdb backtrace 01 0x00001945 in main at bitreverse cu 34 cuda gdb info cuda kernels Kernel Dev Grid SMs Mask GridDim BlockDim Name Args 0 0 1 0x00000001 1 1 1 256 1 1 bitreverse data 0x110000 cuda gdb cuda kernel 0 Sub elata cocus to CUDA kernel 0 geid 3b block 0 0 0 darsee 0 070 7 device 0 sm 0 warp 0 lane 0 9 unsigned int idata unsigned int data cuda gdb backtrace 0 bitreverse lt lt lt 1 1 1 256 1 1 gt gt gt data 0x110000 at bitreverse
21. 27 042 _v5 5 30 Inspecting Program State For disassembly instruction to work properly cuobjdump must be installed and present in your PATH www nvidia com CUDA Debugger DU 05227 042 _v5 5 31 Chapter 9 EVENT NOTIFICATIONS As the application is making forward progress CUDA GDB notifies the users about kernel events and context events Within CUDA GDB kernel refers to the device code that executes on the GPU while context refers to the virtual address space on the GPU for the kernel You can turn ON or OFF the display of CUDA context and kernel events to review the flow of the active contexts and kernels 9 1 Context Events By default any time a CUDA context is created pushed popped or destroyed by the application CUDA GDB displays a notification message The message includes the context id and the device id to which the context belongs Context Create of context Oxad2fe60 on Device 0 Context Destroy of context Oxad2fe60 on Device 0 The context event notification policy is controlled with the context events option gt cuda gdb set cuda context_events on CUDA GDB displays the context event notification messages default gt cuda gdb set cuda context events off CUDA GDB does not display the context event notification messages 9 2 Kernel Events By default when CUDA GDB is made aware of the launch or the termination of a CUDA kernel launched from the host a notification message is displayed Th
22. A NVIDIA CUDA GDB CUDA DEBUGGER TABLE OF CONTENTS Chapter 1 Introduction 3 sandi e cese bae secas esee isse ceccdecdddaceesdecatcacmetecceatac eios 1 1 1 What is CUDA GDB mem DET 1 1 2 Supported Features serieen annaua reet none Xx nale ne ER EXER TA clones nan SEA E RARIOR RI RES SR REA denise 1 1 3 About This DOCUMEN scere ke oper n tea arre osised e sae dies ait de OE TERRE avasin T ERE 2 Chapter 2 Release Notes ciones senes n nes nano nnne usi eda e nnne ideesid de aid anna eee ii 3 Chapter 3 Gettirig Started ccc cicda eere Led eL rei A esee RETO daa emad sikud 6 3 1 Installation Instr ctions erret rene neret tnn nn unn sen vee nn sigu ve Ie E Vx we Ra duane nanos 6 3 2 Setting Up the Debugger Environment eeeeenneeneneeee eee ehe enne hne 6 32 15 Kr 11 SPRE AS P 6 32 25 TEOLD CERE 6 3 2 3 Temporary Directory eise eter trek ean E ads seas samale EIE E ERR RR ENTE REF LCS Ra ORE 8 3 3 Compiling the Application erreervenreeeeeeeeee nene ee ee ehe ese enne 8 3 3 1 Debug Compilation rrr ea rn Ron rhe r kaasa eaa kaa aku eo PERF Ka era ees 8 3 3 2 Compiling For Specific GPU architectures ereeeerreeeneeneen nan eeeeene eea 8 3 4 Usine the DEDUS SER cce earth eot err EPA ERE Er rta Ke En er Ren Ko EE EE eU ERR ENERET 9 3 4 1 Single GPU Debugging sssr sarissa spe ni kena EEn hehehe there restes vee arae 9 3 4 2 Single
23. MER TOCKO MONOC k E GOT An tena 00 tasa THREADS EPR BLOCK threadi umeros teo ooe k tHE AD S DE REBROS ke cudaMalloc amp amp host data idx sizeof int cudaMemcpy host data idx amp amp zero sizeof int cudaMemcpyHostToDevice KA sim series an Greece Inco lolleele RE read SOn host_data 3 THREADS PER BLOCK 39 NULL Copy the array of pointers to the device cudaMalloc void amp amp dev data sizeof host data cudaMemcpy dev data host data sizeof host data cudaMemcpyHostToDevice Execute example example amp lt amp lt amp lt NUM BLOCKS THREADS PER BLOCK gt gt gt dev data cudaThreadSynchronize In this small example we have an array of pointers to integers and we want to do some operations on the integers Suppose however that one of the pointers is NULL as shown in line 38 This will cause CUDA EXCEPTION 10 Device Illegal Address to be thrown when we try to access the integer that corresponds with block 3 thread 39 This exception should occur at line 16 when we try to write to that value 12 2 1 Debugging with Autosteps 1 Compile the example and start CUDA GDB as normal We begin by running the program cuda gdb run Starting program home jitud cudagdb test autostep ex example Thread debugging using libthread db enabled New Thread Ox7ffff5688700 LWP 9083 Context Create of context 0x617270 on Devi
24. S ALL IMPLIED WARRANTIES OF NONINFRINGEMENT MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE Information furnished is believed to be accurate and reliable However NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation Specifications mentioned in this publication are subject to change without notice This publication supersedes and replaces all other information previously supplied NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation Trademarks NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U S and other countries Other company and product names may be trademarks of the respective companies with which they are associated Copyright 2007 2013 NVIDIA Corporation All rights reserved e www nvidia com nVIDIA
25. SSE SEI JEE JEE JE 0x00000000 0x0000000000000000 0 0 0 0 4 IE EHE JE 4E JE E AE 0x00000000 0x0000000000000000 0 0 0 0 5 ORE EAE 0x00000000 0x0000000000000000 0 0 0 0 6 IE SEE dE dE JE IE IE 0x00000000 0x0000000000000000 0 0 0 0 7 IE SEXE YE dE YE IE AE 0x00000000 0x0000000000000000 0 0 0 0 8 4 4 info cuda lanes This command displays all the lanes threads for the warp in focus This command supports filters and the default is device current sm current warp current lane all In the example below you can see that all the lanes are at the same physical PC The command can be used to display which lane executes what thread cuda gdb info cuda lanes Ln State Physical PC ThreadIdx Device 0 SM 0 Warp 0 g active 0x000000000000008c 0 0 0 al active 0x000000000000008c 3E c 017 0 2 active 0x000000000000008c 27 015 0 3 active 0x000000000000008c Oro 4 active 0x000000000000008c 4 0 0 5 active 0x000000000000008c 5070 6 active 0x000000000000008c 6 0 0 7 active 0x000000000000008c GA OOD 8 active 0x000000000000008c COPI 9 active 0x000000000000008c 95 05 0 10 active 0x000000000000008c 10 0 0 di active 0x000000000000008c 11 0 0 12 active 0x000000000000008c 12 0 0 13 active 0x000000000000008c 13 0 0 14 active 0x000000000000008c 14 0 0 ALS active 0x000000000000008c 15 0 0 16 active 0x000000000000008c 16 0 0 8 4 5 info cuda kernels This command displays on all the active ke
26. Warps SM Lanes Warp Max Regs Lane Active SMs Mask aO gt200 sm 13 24 92 32 128 OxOOffffff 8 4 2 info cuda sms This command shows all the SMs for the device and the associated active warps on the SMs This command supports filters and the default is device current sm all A indicates the SM is focus The results are grouped per device cuda gdb info cuda sms SM Active Warps Mask Device 0 0 ESTEE JE YE E 4E YE YE GE IE YE YE E VE IEE JL QERTEIE ETE E KE ETE dE E dE E iE iE aE te PD QE EE IE TE VETE JE AE FEER S SERIE RIE ETE EIE dE JE TE 6 RE Gl OSATI TEE YE TE EAE Ede dE ETE e IE te D IEE TEE YE E dE dE dE TE JE VE TE YE EE GENET JEE EE t E E YE E dE E TE EE 3 HOPE IE GEGEN EJE JE 1E d TE dE VE TE YE EE Qj DATE TEE YE IE dE E dE E ETE TE YE EE www nvidia com CUDA Debugger DU 05227 042 v5 5 26 Inspecting Program State 8 4 3 info cuda warps This command takes you one level deeper and prints all the warps information for the SM in focus This command supports filters and the default is device current sm current warp all The command can be used to display which warp executes what block cuda gdb info cuda warps Wp Active Lanes Mask Divergent Lanes Mask Active Physical PC Kernel BlockIdx Device 0 SM 0 O OSR EEEE EEE 0x00000000 0x000000000000001c 0 0 0 0 iL OSSE SEI JEE JE ENE 0x00000000 0x0000000000000000 0 0 0 0 2 OSSE se IE JEE Jae ae 0x00000000 0x0000000000000000 0 0 0 0 3 O
27. an be resumed gt The scheduler locking option cannot be set to on gt Stepping again after stepping out of a kernel results in undetermined behavior It is recommended to use the continue command instead gt To debug CUDA application that uses OpenGL X server may need to be launched in non interactive mode See CUDA OpenGL Interop Applications on Linux for details gt Pretty printing is not supported gt When remotely debugging 32 bit applications on a 64 bit server gdbserver must be 32 bit www nvidia com CUDA Debugger DU 05227 042 v5 5 52 Known Issues gt Attaching to a CUDA application with Software Preemption enabled in cuda gdb is not supported gt Attaching to CUDA application running in MPS client mode is not supported gt Attaching to the MPS server process nvidia cuda mps server using cuda gdb or starting the MPS server with cuda gdb is not supported gt Ifa CUDA application is started in the MPS client mode with cuda gdb the MPS client will wait untill all other MPS clients have terminated and will then run as non MPS application www nvidia com CUDA Debugger DU 05227 042 _v5 5 53 Notice ALL NVIDIA DESIGN SPECIFICATIONS REFERENCE BOARDS FILES DRAWINGS DIAGNOSTICS LISTS AND OTHER DOCUMENTS TOGETHER AND SEPARATELY MATERIALS ARE BEING PROVIDED AS IS NVIDIA MAKES NO WARRANTIES EXPRESSED IMPLIED STATUTORY OR OTHERWISE WITH RESPECT TO THE MATERIALS AND EXPRESSLY DISCLAIM
28. arp lane block thread block O 0 0 tazea2d 050 0 clewies Sin 0 esp 0 lane cuda gdb cuda kernel block thread kernel 1 block 0 0 0 thread 0 0 0 cuda gdb cuda kernel kernel 1 5 3 Switching Focus To switch the current focus use the cuda command followed by the coordinates to be changed cuda gdb cuda device 0 sm 1 warp 2 lane 3 Switching focus to CUDA kernel 1 grid 2 block 8 0 0 thread 67 0 0 device 0 sm 1 warp 2 lane 3 374 int totalThreads gridDim x blockDim x If the specified focus is not fully defined by the command the debugger will assume that the omitted coordinates are set to the coordinates in the current focus including the subcoordinates of the block and thread cuda gdb cuda thread 15 Switching focus to CUDA kernel 1 grid 2 block 8 0 0 thread 15 0 0 device 0 sm 1 warp 0 lane 15 374 int totalThreads gridDim x blockDim x The parentheses for the block and thread arguments are optional cuda gdb cuda block 1 thread 3 ISmaeelalag FOSU Te GUD JasxeuedL dL grade 2 lolegk 17 0 0 Tarsad 37 41 DV device 0 sm 3 warp 0 lane 3 374 int totalThreads gridDim x blockDim www nvidia com CUDA Debugger DU 05227 042 _v5 5 18 Chapter 6 PROGRAM EXECUTION Applications are launched the same way in CUDA GDB as they are with GDB by using the run command This chapter describes how to interrupt and single step CUDA applications 6 1 Interru
29. arwin the Apple branch Now both versions of CUDA GDB are using the same 7 2 source base Now CUDA GDB supports newer versions of GCC tested up to GCC 4 5 has better support for DWARF3 debug information and better C debugging support Simultaneous Sessions Support With the 4 1 release the single CUDA GDB process restriction is lifted Now multiple CUDA GDB sessions are allowed to co exist as long as the GPUs are not shared between the applications being processed For instance one CUDA GDB process can debug process foo using GPU 0 while another CUDA GDB process debugs process bar using GPU 1 The exclusive of GPUs can be enforced with the CUDA VISIBLE DEVICES environment variable New Autostep Command A new autostep command was added The command increases the precision of CUDA exceptions by automatically single stepping through portions of code Under normal execution the thread and instruction where an exception occurred may be imprecisely reported However the exact instruction that generates the www nvidia com CUDA Debugger DU 05227 042 v5 5 4 Release Notes exception can be determined if the program is being single stepped when the exception occurs Manually single stepping through a program is a slow and tedious process Therefore autostep aides the user by allowing them to specify sections of code where they suspect an exception could occur These sections are automatically single stepped through when the
30. ature This occurs when a thread accesses a global address that is not correctly aligned This occurs when any thread in the warp hits a device side assertion DU 05227 042 v5 5 38 Chapter 11 CHECKING API ERRORS CUDA GDB can automatically check the return code of any driver API or runtime API call If the return code indicates an error the debugger will stop or warn the user The behavior is controlled with the set cuda api failures option Three modes are supported gt hide will not report any error of any kind gt ignore will emit a warning but continue the execution of the application default gt stop will emit an error and stop the application The success return code and other non error return codes are ignored For the driver API those are CUDA SUCCESS and CUDA ERROR NOT READY For the runtime API they are cudaSuccess and cudaErrorNotReady www nvidia com CUDA Debugger DU 05227 042 v5 5 39 Chapter 12 WALK THROUGH EXAMPLES The chapter contains two CUDA GDB walk through examples gt Example 1 bitreverse gt Example 2 autostep gt Example 3 Debugging an MPI CUDA Application 12 1 Example 1 bitreverse This section presents a walk through of CUDA GDB by debugging a sample application called bitreverse that performs a simple 8 bit reversal on a data set Source Code 1 include lt stdio h gt 2 include lt stdlib h gt 3 4 Simple 8 bit bit reversal Comp
31. be used cuda gdbserver 1234 attach 5678 Where 1234 is the TCP port number and 5678 is process identifier of the application cuda gdbserver must be attached to When debugging a 32 bit application on a 64 bit server cuda gdbserver must also be 32 bit www nvidia com CUDA Debugger DU 05227 042 _v5 5 12 Getting Started Launch cuda gdb on the client Configure cuda gdb to connect to the remote target using either cuda gdb target remote or cuda gdb target extended remote It is recommended to use set sysroot command if libraries installed on the debug target might differ from the ones installed on the debug host For example cuda gdb could be configured to connect to remote target as follows cuda gdb target remote 192 168 0 2 1234 cuda gdb set sysroot remote Where 192 168 0 2 is the IP address or domain name of the remote target and 1234 is the TCP port previously previously opened by cuda gdbserver 3 4 7 Multiple Debuggers In a multi GPU environment several debugging sessions may take place simultaneously as long as the CUDA devices are used exclusively For instance one instance of CUDA GDB can debug a first application that uses the first GPU while another instance of CUDA GDB debugs a second application that uses the second GPU The exclusive use of a GPU is achieved by specifying which GPU is visible to the application by using the CUDA VISIBLE DEVICES environment variable
32. ce 0 Launch of CUDA Kernel 0 example lt lt lt 8 1 1 64 1 1 gt gt gt on Device 0 Program received signal CUDA EXCEPTION 10 Device Illegal Address Switching focus to CUDA kernel 0 grid 1 block 1 0 0 thread 0 0 0 device 0 sm 1 warp 0 lane 0 0x0000000000796 60 in exampl data 0x200300000 at example cu 17 17 data idx1 value3 As expected we received a CUDA EXCEPTION 10 However the reported thread is block 1 thread 0 and the line is 17 Since CUDA EXCEPTION 10 is a Global error there is no thread information that is reported so we would manually have to inspect all 512 threads Set autosteps To get more accurate information we reason that since CUDA EXCEPTION 10 is a memory access error it must occur on code that accesses memory This happens on lines 11 12 16 17 and 18 so we set two autostep windows for those areas cuda gdb autostep 11 for 2 lines Breakpoint 1 at 0x796d18 file example cu line 11 Created autostep of length 2 lines cuda gdb autostep 16 for 3 lines www nvidia com CUDA Debugger DU 05227 042 v5 5 44 Walk Through Examples Breakpoint 2 at 0x796e90 file example cu line 16 Created autostep of length 3 lines 3 Finally we run the program again with these autosteps cuda gdb run he program being debugged has been started already Start it from the beginning y or n y Termination of CUDA Kernel 0 example lt lt lt 8 1 1
33. d 20060 For larger applications in the case where you may just want to attach to a few of the processes you can conditionalize the spin loop based on the rank Most MPIs set an environment variable that is the rank of the process For Open MPI it is OMPI COMM WORLD RANK and for MVAPICH itis MV2 COMM WORLD RANK Assuming you want to attach to rank 42 you could add a spin loop like this char stoprank stoprank getenv OMPI COMM WORLD RANK if 42 atoi stoprank Mic ab F 92 char hostname 256 Drink MEEDE Son seadm tor ateach wa getpid hostname fF lush stdout 2 while 0 i sleep 5 Note that by default CUDA GDB allows debugging a single process per node The workaround described in Multiple Debuggers does not work with MPI applications If CUDA_VISIBLE_DEVICES is set it may cause problems with the GPU selection logic in the MPI application It may also prevent CUDA IPC working between GPUs on a node In order to start multiple CUDA GDB sessions to debug individual MPI processes on the same node use the cuda use lockfile 0 option when starting CUDA GDB as www nvidia com CUDA Debugger DU 05227 042 _v5 5 46 Walk Through Examples described in Lock File Each MPI process must guarantee it targets a unique GPU for this to work properly www nvidia com CUDA Debugger DU 05227 042 _v5 5 47 Chapter 13 ADVANCED SETTINGS 13 1 set cuda notify Any time a CUDA event occurs the debugger
34. e p option To make it a permanent option edit System Library LaunchDaemons com apple taskgated plist See man taskgated for more information Here is an example lt xml version 1 0 encoding UTF 8 gt lt DOCIY PE plist UBS A TO SD ED PISIS T AVANUD AA ODS con DEDSHBTOpPEACEyV ENS Ode olus evers on MOUS lt dict gt lt key gt Label lt key gt lt string gt com apple taskgated lt string gt lt key gt MachServices lt key gt mere lt key gt com apple taskgated lt key gt lt dict gt lt key gt TaskSpecialPort lt key gt lt integer gt 9 lt integer gt eite diiler lt key gt ProgramArguments lt key gt lt array gt lt string gt usr libexec taskgated lt string gt lt string gt p lt String gt lt string gt s lt string gt lt array gt LIES SSI gt After editing the file the system must be rebooted or the daemon stopped and relaunched for the change to take effect Using the taskgated as every application in the procmod group will have higher priviledges adding the p option to the taskgated daemon is a possible security risk Debugging in the console mode While debugging the application in console mode it is not uncommon to encounter kernel warnings about unnesting DYLD shared regions for a debugger or a debugged process that look as follows www nvidia com CUDA Debugger DU 05227 042 v5 5 7 Getting Started cuda binary gdb map Oxffffff8038644658 triggered DYLD s
35. e message includes the kernel id the kernel name and the device to which the kernel belongs Launch of CUDA Kernel 1 kernel3 on Device 0 Termination of CUDA Kernel 1 kernel3 on Device 0 No notification is sent for the kernels launched from the GPU www nvidia com CUDA Debugger DU 05227 042 _v5 5 32 Event Notifications The kernel event notification policy is controlled with the kernel events option gt cuda gdb set cuda kernel events on CUDA GDB displays the kernel events default gt cuda gdb set cuda kernel events off CUDA GDB does not display the kernel events In addition to displaying kernel events the underlying policy used to notify the debugger about kernel launches can be changed By default kernel launches cause events that CUDA GDB will processs If the application launches a large number of kernels it is preferable to defer sending kernel launch notifications until the time the debugger stops the application At this time only the kernel launch notifications for kernels that are valid on the stopped devices will be displayed In this mode the debugging session will run a lot faster The deferral of such notifications can be controlled with the defer kernel launch notifications option gt cuda gdb set cuda defer kernel launch notifications off CUDA_GDB receives events on kernel launches default gt cuda gdb set cuda defer kernel launch notifications on CUDA GDB defers receiving inf
36. ect if a visible device is also used for display and return an error To turn off the safeguard mechanism the set cuda gpu busy check should be set to off cuda gdb set cuda gpu busy check off 3 4 6 Remote Debugging There are multiple methods to remote debug an application with CUDA GDB In addition to using SSH or VNC from the host system to connect to the target system it is also possible to use the target remote GDB feature Using this option the local cuda gdb client connects to the cuda gdbserver process the server running on the target system This option is supported with a Linux or Mac OS X client and a Linux server It is not possible to remotely debug a CUDA application running on Mac OS X Setting remote debugging that way is a 2 step process Launch the cuda gdbserver on the remote host cuda gdbserver can be launched on the remote host in different operation modes gt Option 1 Launch a new application in debug mode To launch a new application in debug mode invoke cuda gdb server as follows cuda gdbserver 1234 app invocation Where 1234 isthe TCP port number that cuda gdbserver will listen to for incoming connections from cuda gdb and app invocation is the invocation command to launch the application arguments included gt Option 2 Attach cuda gdbserver to the running process To attach cuda gdbserver to an already running process the attach option followed by process identification number PID must
37. eous debugging of both GPU and CPU code within the same application Just as programming in CUDA C is an extension to C programming debugging with CUDA GDB is a natural extension to debugging with GDB The existing GDB debugging features are inherently present for debugging the host code and additional features have been provided to support debugging CUDA device code CUDA GDB supports C and C CUDA applications All the C features supported by the NVCC compiler can be debugged by CUDA GDB CUDA GDB allows the user to set breakpoints to single step CUDA applications and also to inspect and modify the memory and variables of any given thread running on the hardware CUDA GDB supports debugging all CUDA applications whether they use the CUDA driver API the CUDA runtime API or both www nvidia com CUDA Debugger DU 05227 042 v5 5 1 Introduction CUDA GDB supports debugging kernels that have been compiled for specific CUDA architectures such as sm 10 or sm 20 but also supports debugging kernels compiled at runtime referred to as just in time compilation or JIT compilation for short 1 3 About This Document This document is the main documentation for CUDA GDB and is organized more as a user manual than a reference manual The rest of the document will describe how to install and use CUDA GDB to debug CUDA kernels and how to use the new CUDA commands that have been added to GDB Some walk through examples are also provided It
38. ere n is the ID of the kernel retrieved from info cuda kernels www nvidia com CUDA Debugger DU 05227 042 _v5 5 10 Getting Started The same kernel can be loaded and used by different contexts and devices at the same time When a breakpoint is set in such a kernel by either name or file name and line number it will be resolved arbitrarily to only one instance of that kernel With the runtime API the exact instance to which the breakpoint will be resolved cannot be controlled With the driver API the user can control the instance to which the breakpoint will be resolved to by setting the breakpoint right after its module is loaded 3 4 4 Multi GPU Debugging in Console Mode CUDA GDB allows simultaneous debugging of applications running CUDA kernels on multiple GPUs In console mode CUDA GDB can be used to pause and debug every GPU in the system You can enable console mode as described above for the single GPU console mode 3 4 5 Multi GPU Debugging with the Desktop Manager Running This can be achieved by running the desktop GUI on one GPU and CUDA on the other GPU to avoid hanging the desktop GUI On Linux The CUDA driver automatically excludes the GPU used by X11 from being visible to the application being debugged This might alter the behavior of the application since if there are n GPUs in the system then only n 1 GPUs will be visible to the application On Mac OS X The CUDA driver exposes every CUDA capable GPU in
39. error In that situation either increase the size of the window to make sure that the faulty instruction is included or move to the autostep location to an instruction that will be executed closer in time to the faulty instruction Autostep requires Fermi GPUs or above www nvidia com CUDA Debugger DU 05227 042 _v5 5 35 Checking Memory Errors 10 1 2 Related Commands Autosteps and breakpoints share the same numbering so most commands that work with breakpoints will also work with autosteps 10 1 2 1 info autosteps Shows all breakpoints and autosteps Similar to info breakpoints cuda gdb info autosteps Num Type Disp Enb Address What 1 autostep keep y 0x0000000000401234 in merge at sort cu 30 for 49 instructions 9 autostep keep y 0x0000000000489913 in bubble at sort cu 94 for 11 lines 10 1 2 2 disable autosteps n Disables an autostep Equivalent to disable breakpoints n 10 1 2 3 delete autosteps n Deletes an autostep Equivalent to delete breakpoints n 10 1 2 4 ignore ni Do not single step the next i times the debugger enters the window for autostep n This command already exists for breakpoints 10 2 GPU Error Reporting With improved GPU error reporting in CUDA GDB application bugs are now easier to identify and easy to fix The following table shows the new errors that are reported on GPUs with compute capability sm 20 and higher Continuing the execution of your application after these erro
40. ew processes in a large application CUDA GDB can easily be used If the cluster nodes have xterm support then itis guite easy to use CUDA GDB Just launch CUDA GDB in the same way you would have launched gdb mpirun np 4 host nvi nv2 xterm e cuda gdb a out You may have to export the DISPLAY variable to make sure that the xterm finds its way back to your display For example with Open MPI you would do something like this mpirun np 4 host nvl nv2 x DISPLAY host nvidia com 0 xterm e cuda gdb a out Different MPI implementations have different ways of exporting environment variables to the cluster nodes so check your documentation www nvidia com CUDA Debugger DU 05227 042 v5 5 45 Walk Through Examples In the case where you cannot get xterm support you can insert a spin loop inside your program This works in just the same way as when using gdb on a host only program Somewhere near the start of your program add a code snippet like the following aate at 3 0 ehar ese 12381 printf PID d on node s is ready for attach n getpid host fftlushtstdout r while 0 i sleep 5 Then recompile and run the program After it starts ssh to the nodes of interest and attach to the process Set the variable i to 1 to break out of the loop mpirun np 2 host nvl nv2 a out PID 20060 on node nvl is ready for attach PID 5488 on node nv2 is ready for attach nvi cuda gdb pid 5488 nv2 cuda gdb pi
41. gt Tesla C870 gt Tesla D870 gt Tesla S870 www nvidia com CUDA Debugger Supported Platforms DU 05227 042 _v5 5 51 Appendix B KNOWN ISSUES The following are known issues with the current release gt Setting the cuda memcheck option ON will make all the launches blocking gt Device memory allocated via cudaMalloc is not visible outside of the kernel function gt ON GPUs with sm type lower than sm 20 it is not possible to step over a subroutine in the device code gt Requesting to read or write GPU memory may be unsuccessful if the size is larger than 100MB on Tesla GPUs and larger than 32MB on Fermi GPUs On GPUs with sm 20 if you are debugging code in device functions that get called by multiple kernels then setting a breakpoint in the device function will insert the breakpoint in only one of the kernels Ina multi GPU debugging environment on Mac OS X with Agua running you may experience some visible delay while single stepping the application gt Setting a breakpoint ona line withina device or global function before its module is loaded may result in the breakpoint being temporarily set on the first line of a function below in the source code As soon as the module for the targeted function is loaded the breakpoint will be reset properly In the meantime the breakpoint may be hit depending on the application In those situations the breakpoint can be safely ignored and the application c
42. hared region unnest for map Oxffffff8038644bc8 region 0Ox7fff95e00000 20x7ff 96000000 While not abnormal for debuggers this increases system memory footprint until the target exits To prevent such messages from appearing make sure that the vm shared region unnest logging kernel parameter is set to zero for example by using the following command sudo sysctl w vm shared region unnest logging 0 3 2 3 Temporary Directory By default CUDA GDB uses tmp as the directory to store temporary files To select a different directory set the TMPDIR environment variable The user must have write and execute permission to the temporary directory used by CUDA GDB Otherwise the debugger will fail with an internal error 3 3 Compiling the Application 3 3 1 Debug Compilation NVCC the NVIDIA CUDA compiler driver provides a mechanism for generating the debugging information necessary for CUDA GDB to work properly The g G option pair must be passed to NVCC when an application is compiled in order to debug with CUDA GDB for example nvcc g G foo cu o foo Using this line to compile the CUDA application oo cu gt forces 00 compilation with the exception of very limited dead code eliminations and register spilling optimizations gt makes the compiler include debug information in the executable 3 3 2 Compiling For Specific GPU architectures By default the compiler will only generate PTX code for the compute 10 v
43. ication Those methods are described below The commands to set a breakpoint on the device code are the same as the commands used to set a breakpoint on the host code If the breakpoint is set on device code the breakpoint will be marked pending until the ELF image of the kernel is loaded At that point the breakpoint will be resolved and its address will be updated When a breakpoint is set it forces all resident GPU threads to stop at this location when it hits that corresponding PC When a breakpoint is hit by one thread there is no guarantee that the other threads will hit the breakpoint at the same time Therefore the same breakpoint may be hit several times and the user must be careful with checking which thread s actually hit s the breakpoint 7 1 Symbolic Breakpoints To set a breakpoint at the entry of a function use the break command followed by the name of the function or method cuda gdb break my function cuda gdb break my class my method For templatized functions and methods the full signature must be given cuda gdb break int my templatized function lt int gt int The mangled name of the function can also be used To find the mangled name of a function you can use the following command cuda gdb set demangle style none cuda gdb info function my function name cuda gdb set demangle style auto www nvidia com CUDA Debugger DU 05227 042 v5 5 21 Breakpoints amp Watchpoints 7 2 Line Brea
44. ie cree AE REDDEDE rae ae EEG EEN ERES VE M ses eon ea d Vue E TERR 25 8 4 Info CUDA Comimarnds oot eene ds tre lest taala mia SD Ye tk D UN Te et 25 8 4 1 info cuda devices eae cesses esscec tle eee ec e esa Ee e EN EE DE NEW RE ZEN ERR a erred 26 8 4 2 info CUJA SMS ies eec rore deka aiamaad sina aitan AAA REUS QN ev Mida RIMIS 26 8 4 3 INTO CUGA Wal DSiecesseseooxee vatia e eni ETE Ye E gee Doa Pa vaid hiie er ek bi e ek wert 27 8 4 4 info c da lanes oec isr ona E cR e e a cR ue I va De eere Vu gx EN e YE Rea eX TREE Tue 27 8 4 5 info cuda kernels erret toe sis kastet dee RENNER eene dla sia DER ne dentata EE 27 8 4 6 info c da blocks iesus eroe pr e Eger veo Pty Ere Eye EAEE io telda EE bUE 28 8 4 7 info Cuda threads 5 4us545 etv smi takis v kami vanavanaisa aima maika ama aaa maaalad e 28 8 4 8 info cuda launch trace erreevnnveneneven neces nese sees scenes ee enne enne nene 29 8 4 9 info Cuda launch children eevrvvevnuveennnven nae eennane enne e ssr ee naene ena 30 8 4 10 into c da COnlteXxts s iicet salkades aida dates daa Cre dn CAREN oi 30 8 9 DISASSEMDLY soner n ris Dena EREE der d valme ERI YI RR FAMEM Pr FR E Sign 30 Chapter 9 Event Notifications siss sstvaseni onerosa nena senised skandaale nan ete sei nad 32 9 1 Context EVENS sises ioo reta o rest sana REREFEK KR OR RREEEE TANE EREREE TEES TITEEK REESE EIE 32 9 2 Kernel Everits iis corr trea SE Dey ete
45. irtual architecture Then at runtime the kernels are recompiled for the GPU architecture of the target GPU s Compiling for a specific virtual architecture guarantees that the application will work for any GPU architecture after that for a trade off in performance This is done for forward compatibility It is highly recommended to compile the application once and for all for the GPU architectures targeted by the application and to generate the PTX code for the latest virtual architecture for forward compatibility A GPU architecture is defined by its compute capability The list of GPUs and their respective compute capability see https developer nvidia com cuda gpus The www nvidia com CUDA Debugger DU 05227 042 _v5 5 8 Getting Started same application can be compiled for multiple GPU architectures Use the gencode compilation option to dictacte which GPU architecture to compile for The option can be specified multiple times For instance to compile an application for a GPU with compute capability 3 0 add the following flag to the compilation command gencode arch compute 30 code sm 30 To compile PTX code for any future architecture past the compute capability 3 5 add the following flag to the compilation command gencode arch compute 35 code compute 35 For additional information please consult the compiler documentation at http docs nvidia com cuda cuda compiler driver nvcc index html extended notation 3 4
46. is also possible to detach from the application before letting it run to completion When attached all the usual features of the debugger are available to the user as if the application had been launched from the debugger This feature is also supported with applications using Dynamic Parallelism www nvidia com CUDA Debugger DU 05227 042 v5 5 3 Release Notes Attach on exception Using the environment variable CUDA DEVICE WAITS ON EXCEPTION the application will run normally until a device exception occurs Then the application will wait for the debugger to attach itself to it for further debugging API Error Reporting Checking the error code of all the CUDA driver API and CUDA runtime API function calls is vital to ensure the correctness of a CUDA application Now the debugger is able to report and even stop when any API call returns an error See set cuda api failures for more information Inlined Subroutine Support Inlined subroutines are now accessible from the debugger on SM 2 0 and above The user can inspect the local variables of those subroutines and visit the call frame stack as if the routines were not inlined 4 2 Release Kepler Support The primary change in Release 4 2 of CUDA GDB is the addition of support for the new Kepler architecture There are no other user visible changes in this release 4 1 Release Source Base Upgraded to GDB 7 2 Until now CUDA GDB was based on GDB 6 6 on Linux and GDB 6 3 5 on D
47. ked as Terminated For the few cases when the debugger cannot determine if a kernel is pending or terminated the status is set to Undetermined This command supports filters and the default is kernel a11 8 4 9 info cuda launch children This command displays the list of non terminated kernels launched by the kernel in focus For each kernel the kernel ID the device ID the grid Id the kernel dimensions the kernel name and the kernel parameters are displayed cuda gdb info cuda launch children Kernel Dev Grid GridDim BlockDim Invocation d 3 0 F i151 1 1 1 kernel5 a 3 18 0 8 1 1 1 32 1 1 kernel4 b 5 This command supports filters and the default is kernel all 8 4 10 info cuda contexts This command enumerates all the CUDA contexts running on all GPUs A indicates the context currently in focus This command shows whether a context is currently active on a device or not cuda gdb info cuda contexts Context Dev State 0x080b9518 0 inactive 0x08067948 0 active 8 5 Disassembly The device SASS code can be disassembled using the standard GDB disassembly instructions such as x i and display i cuda gdb x 4 pc gt Ox7a5cf0 lt Z9fool0Params Ox7a5cf8 lt Z9fool0Params Ox7a5d00 lt Z9fool0Params 0x7a5d08 lt Z9fool0Params Params 752 gt IMUL R2 RO R3 Params 760 gt MOV R3 R4 Params 768 gt IMUL RO RO R3 Params 776 gt IADD R18 RO R3 www nvidia com CUDA Debugger DU 052
48. kpoints To set a breakpoint on a specific line number use the following syntax cuda gdb break my file cu 185 If the specified line corresponds to an instruction within templatized code multiple breakpoints will be created one for each instance of the templatized code 7 3 Address Breakpoints To set a breakpoint at a specific address use the break command with the address as argument cuda gdb break 0xlafe34d0 The address can be any address on the device or the host 7 4 Kernel Entry Breakpoints To break on the first instruction of every launched kernel set the break_on_launch option to application cuda gdb set cuda break_on_launch application Possible options are application kernel launched by the user application system any kernel launched by the driver such as memset all any kernel application and system none no kernel application or system Those automatic breakpoints are not displayed by the info breakpoints command and are managed separately from individual breakpoints Turning off the option will not delete other individual breakpoints set to the same address and vice versa Setting break_on_launch option to any value other then none would force kernel_events option to be set to show 7 5 Conditional Breakpoints To make the breakpoint conditional use the optional if keyword or the cond command cuda gdb break foo cu 23 if threadIdx x 1 amp amp i lt 5 cuda gdb cond 3 threadIdx x
49. locks information about all the active blocks in the current kernel threads information about all the active threads in the current kernel launch trace information about the parent kernels of the kernel in focus launch children information about the kernels launched by the kernels in focus contexts information about all the contexts www nvidia com CUDA Debugger DU 05227 042 _v5 5 25 Inspecting Program State A filter can be applied to every info cuda command The filter restricts the scope of the command A filter is composed of one or more restrictions A restriction can be any of the following device n gt smn gt warp n gt lane n gt kernel n gt gridn v block x y orblock x y gt thread x y z orthread x y z gt breakpoint all and breakpoint n where n x y z are integers or one of the following special keywords current any and all current indicates that the corresponding value in the current focus should be used any and all indicate that any value is acceptable The breakpoint all and breakpoint n filter are only effective for the info cuda threads command 8 4 1 info cuda devices This command enumerates all the GPUs in the system sorted by device index A indicates the device currently in focus This command supports filters The default is device all This command prints No CUDA Devices if no GPUs are found cuda gdb info cuda devices Dev Description SM Type SMs
50. mands will apply only to that host thread unless the application is fully resumed for instance On the device side the focus is always set to the lowest granularity level the device thread 5 1 Software Coordinates vs Hardware Coordinates A device thread belongs to a block which in turn belongs to a kernel Thread block and kernel are the software coordinates of the focus A device thread runs on a lane A lane belongs to a warp which belongs to an SM which in turn belongs to a device Lane warp SM and device are the hardware coordinates of the focus Software and hardware coordinates can be used interchangeably and simultaneously as long as they remain coherent Another software coordinate is sometimes used the grid The difference between a grid and a kernel is the scope The grid ID is unigue per GPU whereas the kernel ID is unigue across all GPUs Therefore there is a 1 1 mapping between a kernel and a grid device tuple Note If software preemption is enabled set cuda software preemption on hardware coordinates corresponding to a device thread are likely to change upon resuming execution on the device However software coordinates will remain intact and will not change for the lifetime of the device thread 5 2 Current Focus To inspect the current focus use the cuda command followed by the coordinates of interest www nvidia com CUDA Debugger DU 05227 042 v5 5 17 Kernel Focus cuda gdb cuda device sm w
51. ne 17 5 2 Current FOCUS sivvsssatuss a 17 5 3 SWitcning FOCUS m enaA OREERT SAORA EEEREN EREEREER 18 Chapter 6 Program Execution eebi teeetil eedisinsed iivet sean eb inna setaliea i asset t 19 6 1 Interrupting the Application vereeeeeeereeeveeneeeneeeveeeeneenae nne enne nnns 19 37 4801 01441 1840 18 2141 21 ARRESTS NSA 19 Chapter 7 Breakpoints amp Watchpoints 1 ccesssscsscsssaen sis tuinsesweosasesassnsese ase ds eddvdss snisasa e 21 7 1 Symbolic Breakpoints i 515065 s s endas thor ern v ka causes ars uba thee vv ase vaanid E CREE 21 7 2 LANE Breakpoittsz oce ees tedesco usitate dame ta RR RAR SEMINE 22 7 3 Address Breakpoihts kiud coe eraot ltse hide teak ee DR Ir C RR Y ERO DRE SEDR FI EN ADD eE 22 www nvidia com CUDA Debugger DU 05227 042 v5 5 ii 7 4 Kernel Entry Breakpoints 4242455 SEE SEE eher enne 22 1 9 Conditional Breakpoints ses one rar reor iQ ae EEE E kae HE YES STR kae ended 22 7 6 Watchpoihts iiiter eter petra eR VRREREE FERETRRAO EE nne EVER MERE ERR V occa VIE FRE CI os 23 Chapter 8 Inspecting Program State 52 5 02 222 ce eina esa ner daa sensa e o a E eaa aa ea RA aa add 24 8 1 Memory and Variables ccrte eerte rr Ete En exon E ERE EV 24 8 2 Variable Storage and Accessibility eereereeeeneeneeeene enne enae naene nee eee 24 8 3 Inspecting Textures ii
52. ormation about kernel launches If the break on launch option is set to any value other than none the deferred kernel launch notifications are disabled www nvidia com CUDA Debugger DU 05227 042 v5 5 33 Chapter 10 CHECKING MEMORY ERRORS The CUDA memcheck feature detects global memory violations and mis aligned global memory accesses This feature is off by default and can be enabled using the following variable in CUDA GDB before the application is run cuda gdb set cuda memcheck on Once CUDA memcheck is enabled any detection of global memory violations and mis aligned global memory accesses will be reported When CUDA memcheck is enabled all the kernel launches are made blocking as if the environment variable CUDA LAUNCH BLOCKING was set to 1 The host thread launching a kernel will therefore wait until the kernel has completed before proceeding This may change the behavior of your application You can also run the CUDA memory checker as a standalone tool named CUDA MEMCHECK This tool is also part of the toolkit Please read the related documentation for more information By default CUDA GDB will report any memory error See Increasing the Precision of Memory Errors With Autostep for a list of the memory errors To increase the number of memory errors being reported and to increase the precision of the memory errors CUDA memcheck must be turned on 10 1 Increasing the Precision of Memory Errors With Autostep A
53. pting the Application If the CUDA application appears to be hanging or stuck in an infinite loop it is possible to manually interrupt the application by pressing CTRL C When the signal is received the GPUs are suspended and the cuda gdb prompt will appear At that point the program can be inspected modified single stepped resumed or terminated at the user s discretion This feature is limited to applications running within the debugger It is not possible to break into and debug applications that have been launched outside the debugger 6 2 Single Stepping Single stepping device code is supported However unlike host code single stepping device code single stepping works at the warp level This means that single stepping a device kernel advances all the active threads in the warp currently in focus The divergent threads in the warp are not single stepped In order to advance the execution of more than one warp a breakpoint must be set at the desired location and then the application must be fully resumed A special case is single stepping over a thread barrier call syncthreads In this case an implicit temporary breakpoint is set immediately after the barrier and all threads are resumed until the temporary breakpoint is hit On GPUs with sm type lower than sm 20 it is not possible to step over a subroutine in the device code Instead CUDA GDB always steps into the device function On GPUs with sm_type sm_20 and higher yo
54. r the render GPU will be enumerated correctly 1 Launch your X session in non interactive mode a Stop your X server b Edit etc X11 xorg conf to contain the following line in the Device section corresponding to your display Option Interactive off c Restart your X server 2 Log in remotely SSH etc and launch your application under CUDA GDB This setup works properly for single GPU and multi GPU configurations 3 Ensure your DISPLAY environment variable is set appropriately For example export DISPLAY 0 0 While X is in non interactive mode interacting with the X session can cause your debugging session to stall or terminate www nvidia com CUDA Debugger DU 05227 042 v5 5 14 Chapter 4 CUDA GDB EXTENSIONS 4 1 Command Naming Convention The existing GDB commands are unchanged Every new CUDA command or option is prefixed with the CUDA keyword As much as possible CUDA GDB command names will be similar to the equivalent GDB commands used for debugging host code For instance the GDB command to display the host threads and switch to host thread 1 are respectively cuda gdb info threads cuda gdb thread 1 To display the CUDA threads and switch to cuda thread 1 the user only has to type cuda gdb info cuda threads cuda gdb cuda thread 1 4 2 Getting Help As with GDB commands the built in help for the CUDA commands is accessible from the cuda gdb command line by using the help command
55. r 13 Advanced Settings essssssescccesossesescccosssesecccosssesscsesossesesceeossseseseeoo 48 1321 4 set CUA TIOLITy ira aa ANIAN INOA REA OOA Mee aite E 48 Ema Su cT 48 Appendix A Supported Platforms ceeeeeeeee eese nens eher 50 A 1 Host Platform Reguirements evreeevvveeenneneneneeeneveeneneee nene eene nnne 50 www nvidia com CUDA Debugger DU 05227 042 v5 5 iii Appendix B Known ISSUES 35k seen 52 www nvidia com CUDA Debugger DU 05227 042 _v5 5 iv Table 1 CUDA Exception Codes www nvidia com CUDA Debugger LIST OF TABLES DU 05227 042 _v5 5 v www nvidia com CUDA Debugger DU 05227 042 _v5 5 vi Chapter 1 INTRODUCTION This document introduces CUDA GDB the NVIDIA CUDA debugger for Linux and Mac OS 1 1 What is CUDA GDB CUDA GDB is the NVIDIA tool for debugging CUDA applications running on Linux and Mac CUDA GDB is an extension to the x86 64 port of GDB the GNU Project debugger The tool provides developers with a mechanism for debugging CUDA applications running on actual hardware This enables developers to debug applications without the potential variations introduced by simulation and emulation environments CUDA GDB runs on Linux and Mac OS X 32 bit and 64 bit CUDA GDB is based on GDB 7 2 on both Linux and Mac OS X 1 2 Supported Features CUDA GDB is designed to present the user with a seamless debugging environment that allows simultan
56. r an instruction address preceded by an asterisk If no LOCATION is specified then the current instruction address is used gt LENGTH specifies the size of the autostep window in number of lines or instructions lines and instructions can be shortened e g I or i If the length type is not specified then lines is the default If the or clause is omitted then the default is 1 line gt astep can be used as an alias for the autostep command gt Calls to functions made during an autostep will be stepped over gt In case of divergence the length of the autostep window is determined by the number of lines or instructions the first active lane in each warp executes Divergent lanes are also single stepped but the instructions they execute do not count towards the length of the autostep window If a breakpoint occurs while inside an autostep window the warp where the breakpoint was hit will not continue autostepping when the program is resumed However other warps may continue autostepping gt Overlapping autosteps are not supported If an autostep is encountered while another autostep is being executed then the second autostep is ignored If an autostep is set before the location of a memory error and no memory error is hit then it is possible that the chosen window is too small This may be caused by the presence of function calls between the address of the autostep location and the instruction that triggers the memory
57. rnels on the GPU in focus It prints the SM mask kernel ID and the grid ID for each kernel with the associated dimensions and arguments The kernel ID is unique across all GPUs whereas the grid ID is unique per GPU The Parent column shows the kernel ID of the parent grid This command supports filters and the default is kernel all www nvidia com CUDA Debugger DU 05227 042 _v5 5 27 Inspecting Program State cuda gdb info cuda kernels Kernel Parent Dev Grid Status SMs Mask GridDim BlockDim Name Args p i 0 2 Active OO Osc 240 1 1 128171 acos maim parms This command will also show grids that have been launched on the GPU with Dynamic Parallelism Kernels with a negative grid ID have been launched from the GPU while kernels with a positive grid ID have been launched from the CPU With the cudaDeviceSynchronize routine it is possible to see grid launches disappear from the device and then resume later after all child launches have completed 8 4 6 info cuda blocks This command displays all the active or running blocks for the kernel in focus The results are grouped per kernel This command supports filters and the default is kernel current block all The outputs are coalesced by default cuda gdb info cuda blocks BlockIdx Rome OCG COTE Ea EE Kernel 1 x OOTO 19177101 0 Ley running Coalescing can be turned off as follows in which case more information on the Device and the SM get displayed cuda gdb set
58. rs are found can lead to application termination or indeterminate results Table 1 CUDA Exception Codes Exception Code Precision Scope of Description of the the Error Error CUDA_EXCEPTION_0 Device Not precise Global error This is a global GPU error caused by Unknown Exception on the GPU the application which does not match any of the listed error codes below This should be a rare occurrence www nvidia com CUDA Debugger DU 05227 042 _v5 5 36 Exception Code CUDA EXCEPTION 1 Lane Illegal Address CUDA EXCEPTION 2 Lane User Stack Overflow CUDA EXCEPTION 3 Hardware Stack Overflow CUDA EXCEPTION 4 Warp Illegal Instruction CUDA EXCEPTION 5 Warp Out of range Address CUDA EXCEPTION 6 Warp Misaligned Address CUDA EXCEPTION 7 Warp Invalid Address Space CUDA EXCEPTION 8 Warp Invalid PC www nvidia com CUDA Debugger Device Precision of the Error Scope of the Error Precise Per lane Reguires thread error memcheck on Precise Per lane thread error Not precise Global error on the GPU PT UL UL UL mI Checking Memory Errors Description Potentially this may be due to Device Hardware Stack overflows or a kernel generating an exception very close to its termination This occurs when a thread accesses an illegal out of bounds global address This occurs when a thread exceeds its stack memory limit This occ
59. se the following command set cuda software preemption on gt Export the following environment variable CUDA DEBUGGER SOFTWARE PREEMPTION 1 Either of the options above will activate software preemption These options must be set prior to running the application When the GPU hits a breakpoint or any other event that would normally cause the GPU to freeze CUDA GDB releases the GPU for use by the desktop or other applications This enables CUDA GDB to debug a CUDA application on the same GPU that is running the desktop GUI and also enables debugging of multiple CUDA applications context switching on the same GPU The options listed above are ignored for GPUs with less than SM3 5 compute capability 3 4 3 Multi GPU Debugging Multi GPU debugging designates the scenario where the application is running on more than one CUDA capable device Multi GPU debugging is not much different than single GPU debugging except for a few additional CUDA GDB commands that let you switch between the GPUs Any GPU hitting a breakpoint will pause all the GPUs running CUDA on that system Once paused you can use info cuda kernels to view all the active kernels and the GPUs they are running on When any GPU is resumed all the GPUs are resumed If the CUDA VISIBLE DEVICES environment is used only the specified devices are suspended and resumed All CUDA capable GPUs may run one or more kernels To switch to an active kernel use cuda kernel n wh
60. the system including the one used by the Agua desktop manager To determine which GPU should be used for CUDA run the 1_Utilities deviceQuery CUDA sample A truncated example output of deviceQuery is shown below Detected 2 CUDA Capable device s Device 0 GeForce GT 330M CUDA Driver Version Runtime Version Sas 3 9 CUDA Capability Major Minor version number 142 Total amount of global memory 512 MBytes 536543232 bytes 6 Multiprocessors x 8 CUDA Cores MP 48 CUDA Cores lose S mageareed eWLame 4351 Device 1 Ouadro K5000 CUDA Driver Version Runtime Version 558 f S45 CUDA Capability Major Minor version number 35g Total amount of global memory 4096 MBytes 4294508544 bytes 8 Multiprocessors x 192 CUDA Cores MP 1536 CUDA Cores lose eemmaeeeseed OVNE 2551 deviceQuery CUDA Driver CUDART CUDA Driver Version 5 5 zCUDA Rumene Version 545 Y www nvidia com CUDA Debugger DU 05227 042 v5 5 11 Getting Started NumDevs 2 Device0 GeForce GT 330M Devicel Quadro K5000 If Device 0 is rendering the desktop then Device 1 must be selected for running and debugging the CUDA application This exclusion of a device can be achieved by setting the CUDA VISIBLE DEVICES environment variable to the index of the device that will be used for CUDA In this particular example the value would be 1 export CUDA VISIBLE DEVICES 1 As a safeguard mechanism cuda gdb will det
61. u can step in over or out of the device functions as long as they are not inlined To force a function to not be inlined by the compiler the noinline keyword must be added to the function declaration www nvidia com CUDA Debugger DU 05227 042 v5 5 19 Program Execution With Dynamic Parallelism on sm 35 several CUDA APIs can now be instantiated from the device The following list defines single step behavior when encountering these APIs gt When encountering device side kernel launches denoted by the lt lt lt gt gt gt launch syntax the step and next commands will have the same behavior and both will step over the launch call When encountering cudaDeviceSynchronize the launch synchronization routine the step and next commands will have the same behavior and both will step over the call When stepping over the call the entire device is resumed until the call has completed at which point the device is suspended without user intervention gt When stepping a device grid launch to completion focus will automatically switch back to the CPU The cuda kernel focus switching command must be used to switch to another grid of interest if one is still resident LJ It is not possible to step into a device launch call nor the routine launched by the call www nvidia com CUDA Debugger DU 05227 042 v5 5 20 Chapter 7 BREAKPOINTS amp WATCHPOINTS There are multiple ways to set a breakpoint on a CUDA appl
62. umes that the source file name is bitreverse cu and that no additional compiler flags are required for compilation See also Debug Compilation 2 Start the CUDA debugger by entering the following command at a shell prompt cuda gdb bitreverse 3 Set breakpoints Set both the host main and GPU bitreverse breakpoints here Also set a breakpoint at a particular line in the device function bitreverse cu 18 cuda gdb break main Breakpoint 1 at 0x18e1 file bitreverse cu line 25 cuda gdb break bitreverse Breakpoint 2 at 0x18al file bitreverse cu line 8 cuda gdb break 21 Breakpoint 3 at 0x18ac file bitreverse cu line 21 4 Run the CUDA application and it executes until it reaches the first breakpoint main set in 3 cuda gdb run Starting program Users CUDA Userl docs bitreverse Reading symbols for shared libraries Breakpoint 1 main at bitreverse cu 25 25 voze el KIIGE ime ip 5 Atthis point commands can be entered to advance execution or to print the program state For this walkthrough let s continue until the device kernel is launched cuda gdb continue Continuing Reading symbols for shared libraries done Reading symbols for shared libraries done Context Create of context 0x80f200 on Device 0 Launch of CUDA Kernel 0 bitreverse lt lt lt 1 1 1 256 1 1 gt gt gt on Device 0 Breakpoint 3 at 0x8667b8 file bitreverse cu line 21 www nvidia com CUDA Debugger DU 05
63. urs when the application triggers a global hardware stack overflow The main cause of this error is large amounts of divergence in the presence of function calls This occurs when any thread within a warp has executed an illegal instruction This occurs when any thread within a warp accesses an address that is outside the valid range of local or shared memory regions This occurs when any thread within a warp accesses an address in the local or shared memory segments that is not correctly aligned This occurs when any thread within a warp executes an instruction that accesses a memory space not permitted for that instruction This occurs when any thread within a warp advances its PC beyond the 40 bit address space DU 05227 042 v5 5 37 Exception Code Precision Scope of of the the Error Error CUDA EXCEPTION 9 Warp Not precise Warp error Hardware Stack Overflow CUDA EXCEPTION 10 Not precise Global error Device Illegal Address CUDA EXCEPTION 11 Lane Precise Per lane Misaligned Address Reguires thread error memcheck on CUDA EXCEPTION 12 Warp Precise Per warp Assert www nvidia com CUDA Debugger Checking Memory Errors Description This occurs when any thread in a warp triggers a hardware stack overflow This should be a rare occurrence This occurs when a thread accesses an illegal out of bounds global address For increased precision use the cuda memcheck fe
64. ute test 5 6 define N 256 7 geel vote bitreverse mad esca 9 unsigned int idata unsigned int data 10 extern shared ant array abi 12 array threadIdx x idata threadIdx x 13 14 array threadIdx x 0xf0f0f0f0 amp array threadIdx x gt gt 4 15 OxOf0f0f0f amp array threadIdx x lt lt 4 16 ea Dekaler OZSSEEESESET array EKS aco 99 2925 0 17 0x33333333 amp array threadIdx x lt lt 2 HES array threadIdx x 0xaaaaaaaa amp array threadIdx x gt gt 1 19 AI 9992 M mueren keltsa Mi TE 20 21 idata threadIdx x array threadIdx x 224481 23 24 int main void 25 wool vuol E NII ame ap 26 unsigned int idata N odata N 27 28 oie 4 Qe X We 34 29 idata i unsigned int i www nvidia com CUDA Debugger DU 05227 042 v5 5 40 Walk Through Examples 30 SL cudaMalloc void amp d sizeof int N 32 cudaMemcpy d idata SIzeof Int N 33 cudaMemcpyHostToDevice 34 35 bitreverse lt lt lt 1 N N sizeof int gt gt gt d 36 iy cudaMemcpy odata d sizeof int N 38 cudaMemcpyDeviceToHost 39 40 os a Op 1 lt INE iaar 41 preme Mon c nU tata Oasis 42 43 cudaFree void d 44 return 0 45 12 1 1 Walking through the Code 1 Begin by compiling the bitreverse cu CUDA application for debugging by entering the following command at a shell prompt nvcc g G bitreverse cu o bitreverse This command ass
65. utostep is a command to increase the precision of CUDA exceptions to the exact lane and instruction when they would not have been otherwise Under normal execution an exception may be reported several instructions after the exception occurred or the exact thread where an exception occurred may not be known unless the exception is a lane error However the precise origin of the exception can be determined if the program is being single stepped when the exception occurs Single stepping manually is a slow and tedious process stepping takes much longer than normal execution and the user has to single step each warp individually www nvidia com CUDA Debugger DU 05227 042 v5 5 34 Checking Memory Errors Autostep aides the user by allowing them to specify sections of code where they suspect an exception could occur and these sections are automatically and transparently single stepped the program is running The rest of the program is executed normally to minimize the slow down caused by single stepping The precise origin of an exception will be reported if the exception occurs within these sections Thus the exact instruction and thread where an exception occurred can be found quickly and with much less effort by using autostep 10 1 1 Usage autostep LOCATION autostep LOCATION for LENGTH lines instructions gt LOCATION may be anything that you use to specify the location of a breakpoint such as a line number function name o

CUDA Debugger

Contents

Download Pdf Manuals

Related Search

Related Contents