Home

cuda-gdb Debugger. User Manual

image

Contents

1. Single GPU Debugging In a single GPU system CUDA GDB can be used to debug CUDA applications only if no X11 server on Linux or no Aqua desktop manager on Mac OS X is running on that system On Linux you can stop the X11 server by stopping the gdm service On Mac OS X you can log in with gt console as the user name in the desktop UI login screen This allows CUDA applications to be executed and debugged in a single GPU configuration Multi GPU Debugging Multi GPU debugging is not much different than single GPU debugging except for a few additional CUDA GDB commands that let you switch between the GPUs Any GPU hitting a breakpoint will pause all the GPUs running CUDA on that system Once paused you can use info cuda kernels to view all the active kernels and the GPUs they are running on When any GPU is resumed all the GPUs are resumed Note If the CUDA_VISIBLE_DEVICES environment is used only the specified devices are suspended and resumed All CUDA capable GPUs may run one or more kernel To switch to an active kernel use cuda kernel lt n gt where n is the id of the kernel retrieved from info cuda kernels Note The same kernel can be loaded and used by different contexts and devices at the same time When a breakpoint is set in such a kernel by either name or file name and line number it will be resolved arbitrarily to only one instance of that kernel With the runtime API the exact instance to which the break
2. 64 bit only and 6 1 64 bit only gt Ubuntu 10 04 10 10 and 11 04 gt Fedora 13 and 14 gt OpenSuse 11 2 gt Suse Linux Enterprise Server 11 1 CUDA GDB DU 05227 001_V4 1 44 GPU Requirements Appendix A SUPPORTED PLATFORMS Debugging is supported on all CUDA capable GPUs with a compute capability of 1 1 or later Compute capability is a device attribute that a CUDA application can query about for more information see the latest NVIDIA CUDA Programming Guide on the NVIDIA CUDA Zone Web site http developer nvidia com object gpucomputing html These GPUs have a compute capability of 1 0 and are not supported GeForce 8800 GTS GeForce 8800 GTX GeForce 8800 Ultra Quadro Plex 1000 Model IV Quadro Plex 2100 Model S4 CUDA GDB Quadro FX 4600 Quadro FX 5600 Tesla C870 Tesla D870 Tesla S870 DU 05227 001_V4 1 45 KNOWN ISSUES The following are known issues with the current release gt Setting the cuda memcheck option ON will make all the launches blocking gt Conditional breakpoints can only be set after the CUDA module is loaded gt Device memory allocated via cudaMalloc is not visible outside of the kernel function gt On GPUs with sm_type lower than sm_20 it is not possible to step over a subroutine in the device code gt Requesting to read or write GPU memory may be unsuccessful if the size is larger than 100MB on Tesla GPUs and larger than 32MB on Fermi GPUs gt On GPUs with s
3. BlockIdx ThreadIdx To BlockIdx ThreadIdx Count Virtual PC Filename Line Kernel 0 0 0 0 GOOF 0 COFCO 2567 0 0 256 0x0000000000866400 bitreverse cu 9 cuda gdb thread Current thread is 1 process 16738 cuda gdb thread 1 Switching to thread 1 process 16738 0 0Ox000019d5 in main at bitreverse cu 34 34 bitreverse lt lt lt l N N sizeof int gt gt gt d cuda gdb backtrace 0 0Ox000019d5 in main at bitreverse cu 34 cuda gdb info cuda kernels Kernel Dev Grid SMs Mask GridDim BlockDim Name Args 0 0 1 0x00000001 1 1 1 256 1 1 bitreverse data 0x110000 CUDA GDB DU 05227 001_V4 1 38 Chapter 011 WALK THROUGH EXAMPLES cuda gdb cuda kernel 0 Switching r ocus to CUDAR kerne MMO guid I sblock 07 O70 thizeacd 0707 0 0 device 0 sm 0 warp 0 lane 0 9 unsigned int idata unsigned int data cuda gdb backtrace 0 bitreverse lt lt lt 1 1 1 256 1 1 gt gt gt data 0x110000 at bitreverse cu 9 7 Corroborate this information by printing the block and thread indexes cuda gdb print blockIdx St ix Op y cuda gdb print threadIdx 2 x 0 y 0 z 0 8 The grid and block dimensions can also be printed cuda gdb print gridDim 3 x 1 y 1 cuda gdb print blockDim 4 x 256 y 1 z 1 9 Advance kernel execution and verify some data cuda gdb next I2 array threadIdx x idata threadIdx x cuda gdb next 14 arra
4. CUDA GDB DU 05227 001_V4 1 31 Chapter 010 CHECKING MEMORY ERRORS If an autostep is encountered while another autostep is being executed then the second autostep is ignored Note Autostep requires Fermi GPUs or above Related Commands Autosteps and breakpoints share the same numbering so most commands that work with breakpoints will also work with autosteps info autosteps Shows all breakpoints and autosteps Similar to info breakpoints cuda gdb info autosteps Num Type Disp Enb Address What all autostep keep y 0x0000000000401234 in merge at sort cu 30 for 49 instructions 3 autostep keep y 0x0000000000489913 in bubble at sort cu 94 for 11 lines disable autosteps n Disables an autostep Equivalent to disable breakpoints n delete autosteps n Deletes an autostep Equivalent to delete breakpoints n ignore ni Do not single step the next i times the debugger enters the window for autostep n This command already exists for breakpoints CUDA GDB DU 05227 001_V4 1 32 GPU Error Reporting Chapter 010 CHECKING MEMORY ERRORS With improved GPU error reporting in CUDA GDB application bugs are now easier to identify and easy to fix The following table shows the new errors that are reported on GPUs with compute capability sm_20 and higher Note S nate results Table 10 1 CUDA Exception Codes Continuing the execution of your application after these errors are found can lea
5. Oi 1070 device 0 sm 1 warp 0 lane 0 0x0000000000796f60 in example data 0x200300000 at example cu 17 i7 data idxl value3 As expected we received a CUDA_EXCEPTION_10 However the reported thread is block 1 thread 0 and the line is 17 Since CUDA_EXCEPTION_10 is a Global error there is no thread information that is reported so we would manually have to inspect all 512 threads 2 Set autosteps To get more accurate information we reason that since CUDA_EXCEPTION_10 is a memory access error it must occur on code that accesses memory This happens on lines 11 12 16 17 and 18 so we set two autostep windows for those areas cuda gdb autostep 11 for 2 lines Breakpoint 1 at 0x796d18 file example cu line 11 Created autostep of length 2 lines cuda gdb autostep 16 for 3 lines Breakpoint 2 at 0x796e90 file example cu line 16 Created autostep of length 3 lines CUDA GDB DU 05227 001_V4 1 42 Chapter 011 WALK THROUGH EXAMPLES 3 Finally we run the program again with these autosteps cuda gdb run The program being debugged has been started already Start it from the beginning y or m y Termination of CUDA Kernel 0 example lt lt lt 8 1 1 64 1 1 gt gt gt on Device 0 Starting program home jitud cudagdb test autostep ex exampl Thread debugging using libthread_db enabled New Thread Ox7ff 5688700 LWP 9089 Context Create of context 0x617270 on Devi
6. Walking Through the Code ssssssssssssesseessessessesseseeseeseesseee 37 Example 2 autostep sssesssssoseossoseosecseossoseoseoseoseoseoseeseessos 41 Source Code vcccisines pexnvauneeseneneecanceedwatawsexenteneustqerseeenecaeannass 41 Debugging With AutostepS ssssssssssssssessssseseessosseseessesees 42 Appendix A Supported PlatformS sseeseeseeseeseesecseeseesecseeseeseesee 44 Host Platform Requirements cccceesescccscccrseseseserecceceseseneses 44 MaC OS orraerernra inorri E TE EEE NE EEEE OEE 44 LE TILER E EE A E ES S A EE N 44 GPU Requirements sesesessessssssecessssesessesssesessseeosososecessese 45 Appendix B Known ISSUES ssseseeseeseccecceeseeseesecsecseesecseeseeseesee 46 Graphics Driver CUDA GDB DU 05227 001_V4 1 iii INTRODUCTION This document introduces CUDA GDB the NVIDIA CUDA debugger and describes what is new in version 4 1 What is CUDA GDB CUDA GDB is the NVIDIA tool for debugging CUDA applications running on Linux and Mac CUDA GDB is an extension to the x86 64 port of GDB the GNU Project debugger The tool provides developers with a mechanism for debugging CUDA applications running on actual hardware This enables developers to debug applications without the potential variations introduced by simulation and emulation environments CUDA GDB runs on Linux and Mac OS X 32 bit and 64 bit CUDA GDB is based on GDB 7 2 on both Linux and Mac OS X S
7. cuda gdb break 21 Breakpoint 3 at Oxl8ac file bitreverse cu line 21 CUDA GDB DU 05227 001_V4 1 37 Chapter 011 WALK THROUGH EXAMPLES 4 Run the CUDA application and it executes until it reaches the first breakpoint main set in step 3 cuda gdb run Starting program Users CUDA Userl docs bitreverse Reading symbols for shared libraries Breakpoint 1 main at bitreverse cu 25 25 vore wel Nube Ioe alg 5 At this point commands can be entered to advance execution or to print the program state For this walkthrough let s continue until the device kernel is launched cuda gdb continue Continuing Reading symbols for shared libraries done Reading symbols for shared libraries done Context Create of context 0x80f200 on Device 0 Launch of CUDA Kernel 0 bitreverse lt lt lt 1 1 1 256 1 1 gt gt gt on Device 0 Breakpoint 3 at 0x8667b8 file bitreverse cu line 21 Swatkchtiing rocs tto CUDANkerne M0 oridi bioc k a Otra OORO device 0 sm 0 warp 0 lane 0 Breakpoint 2 bitreverse lt lt lt 1 1 1 256 1 1 gt gt gt data 0x110000 at bitreverse cu 9 9 unsigned int idata unsigned int data CUDA GDB has detected that a CUDA device kernel has been reached The debugger prints the current CUDA thread of focus 6 Verify the CUDA thread of focus with the info cuda threads command and switch between host thread and the CUDA threads cuda gdb info cuda threads
8. Please read the related documentation for more information By default CUDA GDB will report any memory error See the next section for a list of the memory errors To increase the number of memory errors being reported and to increase the precision of the memory errors CUDA memcheck must be turned on CUDA GDB DU 05227 001_V4 1 30 Chapter 010 CHECKING MEMORY ERRORS Increasing the Precision of Memory Errors WIth Autostep Autostep is a command to increase the precision of CUDA exceptions to the exact lane and instruction when they would not have been otherwise Under normal execution an exception may be reported several instructions after the exception occurred or the exact thread where an exception occurred may not be known unless the exception is a lane error However the precise origin of the exception can be determined if the program is being single stepped when the exception occurs Single stepping manually is a slow and tedious process stepping takes much longer than normal execution and the user has to single step each warp individually Autostep aides the user by allowing them to specify sections of code where they suspect an exception could occur and these sections are automatically and transparently single stepped the program is running The rest of the program is executed normally to minimize the slow down caused by single stepping The precise origin of an exception will be reported if the exception occurs within t
9. The following are examples of context events displayed Context Create of context Oxad2fe60 on Device 0 Context Pop of context Oxad2fe60 on Device 0 Context Destroy of context Oxad2fe60 on Device 0 The following are examples of kernel events displayed Launch of CUDA Kernel 1 kernel3 on Device 0 Termination of CUDA Kernel 1 kernel3 on Device 0 Note The kernel termination event is only displayed when a kernel is launched asynchronously or when the debugger can safely assume that the kernel has terminated CUDA GDB DU 05227 001_V4 1 29 CHECKING MEMORY ERRORS Checking Memory Errors The CUDA memcheck feature detects global memory violations and mis aligned global memory accesses This feature is off by default and can be enabled using the following variable in CUDA GDB before the application is run cuda gdb set cuda memcheck on Once CUDA memcheck is enabled any detection of global memory violations and mis aligned global memory accesses will be reported When CUDA memcheck is enabled all the kernel launches are made blocking as if the environment variable CUDA_LAUNCH_BLOCKING was set to 1 The host thread launching a kernel will therefore wait until the kernel has completed before proceeding This may change the behavior of your application You can also run the CUDA memory checker as a standalone tool named CUDA MEMCHECK This tool is also part of the toolkit
10. appropriately e For example export DISPLAY 0 0 Limitations While X is in non interactive mode interacting with the X session can cause your debugging session to stall or terminate CUDA GDB DU 05227 001_V4 1 11 CUDA GDB EXTENSIONS Command Naming Convention The existing GDB commands are unchanged Every new CUDA command or option is prefixed with the CUDA keyword As much as possible CUDA GDB command names will be similar to the equivalent GDB commands used for debugging host code For instance the GDB command to display the host threads and switch to host thread 1 are respectively cuda gdb info threads cuda gdb thread 1 To display the CUDA threads and switch to cuda thread 1 the user only has to type cuda gdb info cuda threads cuda gdb cuda thread 1 Getting Help As with GDB commands the built in help for the CUDA commands is accessible from the cuda gdb command line by using the help command cuda gdb help cuda name_of the _cuda_command cuda gdb help set cuda name _of the _cuda_option cuda gdb help info cuda name _of _the_info_cuda_command CUDA GDB DU 05227 001_V4 1 12 Chapter 04 CUDA GDB EXTENSIONS Initialization File The initialization file for CUDA GDB is named cuda gdbinit and follows the same rules as the standard gdbinit file used by GDB The initialization file may contain any CUDA GDB command Those commands will be processed in order
11. if no GPUs are found cuda gdb info cuda devices Dev Description SM Type SMs Warps SM Lanes Warp Max Regs Lane Active SMs Mask oe gt200 sm_13 24 32 32 128 OxO0ffffLEL CUDA GDB DU 05227 001_V4 1 23 Chapter 08 INSPECTING PROGRAM STATE info cuda sms This command shows all the SMs for the device and the associated active warps on the SMs This command supports filters and the default is device current sm all A indicates the SM is focus The results are grouped per device cuda gdb info cuda sms SM Active Warps Mask Device 0 re O OEE IP IE SPIE IE IIE IE IE Ie AE IE Ie OSE I IEICE TE IE IE Ie Ie IIE IRIE Jee OSE SESE AE 1E AE 1E aE SESE SE IE SESE SESE OIE ESE ESE SESE SESE SESE SESE SESE AE OSS IE GE aE IE AE SE SESE AE SESE IESE IE IE OSE E EE E ICIP IRIE IE IE IE BAe IE Ie Q SSE IE IEICE IE IE IIE IE IE IE IE IE Jee OSE E IPI IE ICIP IIE IE Ee BIE IEE OPE IRIE E PIE IE In Ie Ie ae Co sal fen On eS es Ts SS info cuda warps This command takes you one level deeper and prints all the warps information for the SM in focus This command supports filters and the default is device current sm current warp all The command can be used to display which warp executes what block cuda gdb info cuda warps Wp Active Lanes Mask Divergent Lanes Mask Active Physical PC Kernel BlockIdx Device 0 SM 0 OSS IEICE IE SEE E 0x00000000 0x000000000000001c OR COR 070 1 O ERRARE RET 0x00000
12. simultaneously as long as they remain coherent Another software coordinate is sometimes used the grid The difference between a grid and a kernel is the scope The grid ID is unique per GPU whereas the kernel ID is unique across all GPUs Therefore there is a 1 1 mapping between a kernel and a grid device tuple Current Focus To inspect the current focus use the cuda command followed by the coordinates of interest cuda gdb cuda device sm warp lane block thread block 0 0 0 thread 0 0 0 device 0 sm 0 warp 0 lane Q cuda gdb cuda kernel block thread kernel 1 block 0 0 0 thread 0 0 0 cuda gdb cuda kernel kernel 1 CUDA GDB DU 05227 001_V4 1 14 Chapter 05 KERNEL Focus Switching Focus To switch the current focus use the cuda command followed by the coordinates to be changed cuda gdb cuda device 0 sm 1 warp 2 lane 3 Swakchiing focus to CUDAN Kernel yA onde bloc 8 10 O thread r070 devaice 0 sm ly wano 2 lane si 374 int totalThreads gridDim x blockDim x If the specified focus is not fully defined by the command the debugger will assume that the omitted coordinates are set to the coordinates in the current focus including the subcoordinates of the block and thread cuda gdb cuda thread 15 Switching focus tos CUDAN Kernel oride2 block 87 0 0 a thread S70 0 device nem arp 0 anen 374 int totalThreads gridDim x blockDim x The parentheses f
13. when CUDA GDB is launched GUI Integration Emacs CUDA GDB works with GUD in Emacs and XEmacs No extra step is required other than pointing to the right binary To use CUDA GDB the gud gdb command name variable must be set to cuda gdb annotate 3 Use M x customize variable to set the variable Ensure that cuda gdb is present in the Emacs XEmacs PATH DDD CUDA GDB works with DDD To use DDD with CUDA GDB launch DDD with the following command ddd debugger cuda gdb cuda gdb must be in your PATH CUDA GDB DU 05227 001_V4 1 13 KERNEL FOCUS A CUDA application may be running several host threads and many device threads To simplify the visualization of information about the state of application commands are applied to the entity in focus When the focus is set to a host thread the commands will apply only to that host thread unless the application is fully resumed for instance On the device side the focus is always set to the lowest granularity level the device thread Software Coordinates vs Hardware Coordinates A device thread belongs to a block which in turn belongs to a kernel Thread block and kernel are the software coordinates of the focus A device thread runs on a lane A lane belongs to a warp which belongs to an SM which in turn belongs to a device Lane warp SM and device are the hardware coordinates of the focus Software and hardware coordinates can be used interchangeably and
14. 000 0x0000000000000000 070 0 2 Oae 1E IE IEEE E E 0x00000000 0x0000000000000000 o 07070 3 Oo E 1E 1E 1E 1E fetes ta 0x00000000 0x0000000000000000 07 070 4 O E 0x00000000 0x0000000000000000 OR 0 070 5 OBE EAE AE IE EIE E 0x00000000 0x0000000000000000 OR 0 070 6 OpSiE aE IE IEE IE E E 0x00000000 0x0000000000000000 070 0 7 OSSiE EE E EEE E 0x00000000 0x0000000000000000 0 07070 CUDA GDB DU 05227 001_V4 1 24 Chapter 08 INSPECTING PROGRAM STATE info cuda lanes This command displays all the lanes threads for the warp in focus This command supports filters and the default is device current sm current warp current lane all In the example below you can see that all the lanes are at the same physical PC The command can be used to diplay which lane executes what thread cuda gdb info cuda lanes Ln State Physical PC ThreadIdx Device 0 SM 0 Warp 0 o active 0x000000000000008c 07070 iL active 0x000000000000008c 17070 2 active 0x000000000000008c 2 0 0 3 active 0x000000000000008c 57 070 4 active 0x000000000000008c 4 0 0 5 active 0x000000000000008c 570970 6 active 0x000000000000008c 67070 7 active 0x000000000000008c 7r 070 8 active 0x000000000000008c 8 0 0 9 active 0x000000000000008c 9 0 0 10 active 0x000000000000008c 10 0 0 ii active 0x000000000000008c 11 0 0 12 active 0x000000000000008c 12 0 0 15 active 0x000000000000008c 13 0 0 14 active 0x000000000000008c 14 0 0 LS active 0x
15. 000000000000008c 15 0 0 16 active 0x000000000000008c 16 0 0 info cuda kernels This command displays on all the active kernels on the GPU in focus It prints the SM mask kernel ID and the grid ID for each kernel with the associated dimensions and arguments The kernel ID is unique across all GPUs whereas the grid ID is unique per GPU This command supports filters and the default is kernel all cuda gdb info cuda kernels Kernel Dev Grid SMs Mask GridDim BlockDim Name Args ees i 0 A OWpdOQOiciiccigie 20 i il lz i il aeos meim perms CUDA GDB DU 05227 001_V4 1 25 Chapter 08 INSPECTING PROGRAM STATE info cuda blocks This command displays all the active or running blocks for the kernel in focus The results are grouped per kernel This command supports filters and the default is kernel current block all The outputs are coalesced by default cuda gdb info cuda blocks BlockIdx To BlockIdx Count State Kernel 1 x 0 0 0 GIRS AIR OF 03 192 running Coalescing can be turned off as follows in which case more information on the Device and the SM get displayed cuda gdb set cuda coalescing off The following is the output of the same command when coalescing is turned off cuda gdb info cuda blocks BlockIdx State Dev SM Kernel 1 S CO O O running 0 0 amomo running 0 5 COFON running 0 6 37070 running 0 9 a 0 O running QO L2 O7 Oro running 15 6 0 0 ru
16. 4 1 40 Chapter 011 WALK THROUGH EXAMPLES Example 2 autostep This section shows how to use the autostep command and demonstrates how it helps increase the precision of memory error reporting Source Code 1 define NUM BLOCKS 8 2 define THREADS PER BLOCK 64 3 4 5 6 7 8 9 10 L 2 3 4 5 6 7 8 lS 20 2i 22 23 24 25 26 27 28 29 30 Sil 32 33 34 35 36 Sy 38 89 40 __global__ void example int data int valuel value2 value3 value4 valued aioe Weksle ebs Lebe tebal bilockidx x bileckDam E idx2 threadIdx x iebee nebal sh ORZA valuel data idx1 value2 data idx2 value3 valuel value2 value4 valuel value2 value5 value3 value4 data idx3 value5 data idxl value3 data idx2 idxl idx2 idx3 0 value4 int main int argc char argv int host_data NUM_BLOCKS THREADS PER BLOCK aioe ele Cleibely const int zero 0 Allocate an integer for each thread in each block for ant billock 0 block lt NUM BLOCKS bilockt Bor nite suhecad O threadi lt lt THREADS PERE BLOCK sehrsead a int idx thread block THREADS PER BLOCK cudaMalloc amp host_data idx sizeof int cudaMemcpy host_data idx amp zero sizeof int cudaMemcpyHostToDevice This inserts an error into block 3 thread 39 host_dat
17. 88c o Q 0 7 acos cu 316 07070 8 0 0 Ox000000000088 88c Oo 8 aces cu 376 070m0 9 0 0 Ox000000000088 88c Oo 9 acos cu 316 Note In coalesced form threads must be contiguous in order to be coalesced If some threads are not currently running on the hard ware they will create holes in the thread ranges For instance if a kernel consist of 2 blocks of 16 threads and only the 8 lowest threads are active then 2 coalesced ranges will be printed one range for block O thread 0 to 7 and one range for block 1 thread 0 to 7 Because threads 8 15 in block O are not running the 2 ranges cannot be coalesced CUDA GDB DU 05227 001_V4 1 27 CONTEXT AND KERNEL EVENTS Within CUDA GDB kernel refers to your device code that executes on the GPU while context refers to the virtual address space on the GPU for your kernel You can turn ON or OFF the display of CUDA context and kernel events to review the flow of the active contexts and kernels Display CUDA context events gt cuda gdb set cuda context_events 1 Display CUDA context events gt cuda gdb set cuda context_events 0 Do not display CUDA context events Display CUDA kernel events gt cuda gdb set cuda kernel_events 1 Display CUDA kernel events gt cuda gdb set cuda kernel_events 0 Do not display CUDA kernel events CUDA GDB DU 05227 001_V4 1 28 Chapter 09 CONTEXT AND KERNEL EVENTS Examples of displayed events
18. A NVIDIA CUDA GDB NVIDIA CUDA Debugger 4 1 Release for Linux and Mac TABLE OF CONTENTS 1 IMEROGUCEIOM sessen annaa aa a aa a a 1 What is CUDA GDB esssssssssssscsesessscsesssseceesecsssosseceesessseeeeees 1 Supported features 2 0 cece cee cee cece eee e cee e eee ee ee eeee eee eee eee eeeeeeeeeeecs 1 About this GOCUMENE ssssvscsrrssissoriisisererss ier soire kerseet iske ensnewes kaniedivs 2 2 Release Notes mrssssrrssenessensnrirdi s seri n rE E EEEE NERE RERET E E TS 3 GDB 7 2 Source B Skssiersresnrene inerce e n a a a E 3 Support For Simultaneous CUDA GDB Sessions cccecee cee ceeeeeeees 3 New Autostep Command s ssesssssesscssesseesseseeesescoseescosesseseesses 4 Support For Multiple Contexts cece cece cece cee cee cee eeeceeeeceeeeeeeecs 4 Support for Device Assertions c ccc cc cece eee eeceee ences cee ceeeeseeeeecencs 4 3 Getting Started ian rsaetasenncin sian winsias ie eeneeieeedniens 5 Installation Instructions sessessessessesseesssscessescosseeessesseseesses 5 Setting Up the Debugger Environment sssssssssssscesssssseosseeseeeess 6 LINEN E E E O T 6 Mac OSX iccncncunumeueereseuneseegundadasnxecawnerenenniegereaaeerte des einemieans 6 Compiling the Application s lt seescanvusevesdeyedeade vende sespendecutsicavremeunes 7 Debug Compilation larwsevandeesnesdestuncreaeuneseneenesncmmontieniacastatamces 7 Compiling for Fermi GPUS cece ccc cece cece s
19. DA compiler driver provides a mechanism for generating the debugging information necessary for CUDA GDB to work properly The g G option pair must be passed to NVCC when an application is compiled in order to debug with CUDA GDB for example nvcc g G foo cu o foo Using this line to compile the CUDA application foo cu gt forces 00 compilation with the exception of very limited dead code eliminations and register spilling optimizations gt makes the compiler include debug information in the executable Compiling for Fermi GPUs For Fermi GPUs add the following flags to target Fermi output when compiling the application gencode arch compute_20 code sm_20 It will compile the kernels specifically for the Fermi architecture once and for all If the flag is not specified then the kernels must be recompiled at runtime every time Compiling for Fermi and Tesla GPUs If you are targeting both Fermi and Tesla GPUs include these two flags gencode arch compute_20 code sm_20 gencode arch compute_10 code sm_10 Note It is highly recommended to use the gencode flag whenever possible CUDA GDB DU 05227 001_V4 1 7 Chapter 03 GETTING STARTED Using the Debugger Debugging a CUDA GPU involves pausing that GPU When the graphics desktop manager is running on the same GPU then debugging that GPU freezes the GUI and makes the desktop unusable To avoid this use CUDA GDB in the following system configurations
20. ERWISE WITH RESPECT TO THE MATERIALS AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE Information furnished is believed to be accurate and reliable However NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation Specifications mentioned in this publication are subject to change without notice This publication supersedes and replaces all other information previously supplied NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation Trademarks NVIDIA the NVIDIA logo NVIDIA nForce GeForce NVIDIA Quadro NVDVD NVIDIA Personal Cinema NVIDIA Soundstorm Vanta TNT2 TNT RIVA RIVA TNT VOODOO VOODOO GRAPHICS WAVEBAY Accuview Antialiasing Detonator Digital Vibrance Control ForceWare NVRotate NVSensor NVSync PowerMizer Quincunx Antialiasing Sceneshare See What You ve Been Missing StreamThru SuperStability T BUFFER The Way It s Meant to be Played Logo TwinBank TwinView and the Video amp Nth Superscript Design Logo are registered trademarks or trademarks of NVIDIA Corporation in the United States and or other countries Other comp
21. Illegal thread within a warp Instruction has executed an illegal instruction CUDA_EXCEPTION_5 Not precise Warp error This occurs when any Warp Out of range thread within a warp Address accesses an address that is outside the valid range of local or shared memory regions CUDA_EXCEPTION_6 Not precise Warp error This occurs when any Warp Misaligned thread within a warp Address accesses an address in the local or shared memory segments that is not correctly aligned CUDA_EXCEPTION_7 Not precise Warp error This occurs when any Warp Invalid Address thread within a warp Space executes an instruction that accesses a memory space not permitted for that instruction CUDA_EXCEPTION_8 Not precise Warp error This occurs when any Warp Invalid PC thread within a warp advances its PC beyond the 40 bit address space CUDA_EXCEPTION_9 Not precise Warp error This occurs when any Warp Hardware Stack thread in a warp Overflow triggers a hardware stack overflow This should be a rare occurrence CUDA_EXCEPTION_10 Not precise Global error This occurs when a thread accesses an illegal out of bounds global address For increased precision use the cuda memcheck feature CUDA GDB DU 05227 001_V4 1 34 Chapter 010 Table 10 1 CUDA Exception Codes continued CHECKING MEMORY ERRORS Exception code Precision of Scope of the Error Description the Error CUDA_EXCEPTION_11 Precise
22. Per lane thread error This occurs when a Lane Misaligned Requires thread accesses a Address CUDA_EXCEPTION_12 Warp Assert memcheck on Precise Per warp global address that is not correctly aligned This occurs when any thread in the warp hits a device side assertion CUDA GDB DU 05227 001_V4 1 35 WALK THROUGH EXAMPLES The chapter contains two CUDA GDB walk through examples gt Example 1 bitreverse gt Example 2 autostep Example 1 bitreverse This section presents a walk through of CUDA GDB by debugging a sample application called bitreverse that performs a simple 8 bit reversal on a data set Source Code 1 include lt stdio h gt 2 include lt stdlib h gt 3 4 Simple 8 bit bit reversal Compute test 5 6 define N 256 T o global__ void bitreverse void data 9 unsigned int idata unsigned int daa 10 extern shared ant array albeit T2 array threadIdx x idata threadIdx x 13 14 array threadIdx x Oxf0f0f0f0 amp array threadIdx x gt gt 4 3 OxOf0f0f0f amp array threadIdx x lt lt 4 16 array threadIdx x Oxccccccce amp array threadIdx x gt gt 2 I G Ox33333 SSCs array l eareecHieks scl lt lt 2E 18 array threadIdx x Oxaaaaaaaa amp array threadIdx x gt gt 1 L9 1053355355355 amp array cloucseuclieke sc lt lt il p 20 zi idata threadIdx x array thre
23. STATE Info CUDA Commands These are commands that display information about the GPU and the application s CUDA state The available options are devices information about all the devices sms information about all the SMs in the current device warps information about all the warps in the current SM lanes information about all the lanes in the current warp kernels information about all the active kernels vvv vV vV vV V blocks information about all the active blocks in the current kernel gt threads information about all the active threads in the current kernel A filter can be applied to every info cuda command The filter restricts the scope of the command A filter is composed of one or more restrictions A restriction can be any of the following gt device n gt sm n gt warp n gt lane n gt kernel n gt grid n gt block x y or block x y gt thread x y z or thread x y z2 where n x y z are integers or one of the following special keywords current any and all current indicates that the corresponding value in the current focus should be used any and all indicate that any value is acceptable info cuda devices This command enumerates all the GPUs in the system sorted by device index A indicates the device currently in focus This command supports filters The default is device all This command prints No CUDA Devices
24. a 3 THREADS PER BLOCK 39 NULL Copy the array of pointers to the device cudaMalloc void amp dev_data sizeof host_data CUDA GDB DU 05227 001_V4 1 41 Chapter 011 WALK THROUGH EXAMPLES 41 cudaMemcpy dev_data host data sizeof host_data cudaMemcpyHostToDevice 42 43 Execut xample 44 example lt lt lt NUM BLOCKS THREADS PER BLOCK gt gt gt dev_data 45 cudaThreadSynchronize 46 47 In this small example we have an array of pointers to integers and we want to do some operations on the integers Suppose however that one of the pointers is NULL as shown in line 37 This will cause CUDA_EXCEPTION_10 Device Illegal Address to be thrown when we try to access the integer that corresponds with block 3 thread 39 This exception should occur at line 16 when we try to write to that value Debugging With Autosteps 1 Compile the example and start CUDA GDB as normal We begin by running the program cuda gdb run Starting program home jitud cudagdb test autostep ex exampl Thread debugging using libthread_db enabled New Thread Ox7ff 5688700 LWP 9083 Context Create of context 0x617270 on Device 0 Launch of CUDA Kernel 0 example lt lt lt 8 1 1 64 1 1 gt gt gt on Device 0 Program received signal CUDA_EXCEPTION 10 Device Illegal Address SwitechingurLocusstorCUDA kernel 0 ord block a 010 tat hreacden
25. adIdx x CUDA GDB DU 05227 001_V4 1 36 Chapter 011 WALK THROUGH EXAMPLES 2a 28 24 int main void 25 VOLC Cl NUH ae aLe 26 unsigned int idata N odata N 2T 28 Owm L OF al lt Ne aF 29 idata i unsigned int i 30 Sil cudaMalloc void amp d sizeof int N 32 cudaMemcpy d idata sizeof int N 33 cudaMemcpyHostToDevice 34 35 bitreverse lt lt lt 1 N N sizeof int gt gt gt d 36 S cudaMemcpy odata d sizeof int N 38 cudaMemcpyDeviceToHost 39 40 inoue Gl Op al lt Ne 12t 41 omine Wm gt io siclercallal oaceae 42 43 cudaFree void d 44 return 0 45 Walking Through the Code 1 Begin by compiling the bitreverse cu CUDA application for debugging by entering the following command at a shell prompt nvcc g G bitreverse cu o bitreverse This command assumes that the source file name is bitreverse cu and that no additional compiler flags are required for compilation See also Compiling for Debugging on page 20 2 Start the CUDA debugger by entering the following command at a shell prompt cuda gdb bitreverse 3 Set breakpoints Set both the host main and GPU bitreverse breakpoints here Also set a breakpoint at a particular line in the device function bitreverse cu 18 cuda gdb break main Breakpoint 1 at 0xl8el file bitreverse cu line 25 cuda gdb break bitreverse Breakpoint 2 at 0xl8al file bitreverse cu line 8
26. allation Instructions Follow these steps to install CUDA GDB 1 Visit the NVIDIA CUDA Zone download page http www nvidia com object cuda_get html 2 Select the appropriate operating system MacOS X or Linux See Host Platform Requirements on page 26 3 Download and install the CUDA Driver 4 Download and install the CUDA Toolkit CUDA GDB DU 05227 001_V4 1 5 Chapter 03 GETTING STARTED Setting Up the Debugger Environment Linux Set up the PATH and LD_LIBRARY_PATH environment variables export PATH usr local cuda bin PATH export LD LIBRARY_PATH usr local cuda 1ib64 usr local cuda 1ib LD_LIBRARY_PATH Mac OS X Set up the PATH and DYLD_LIBRARY_PATH environment variables export PATH usr local cuda bin PATH export DYLD LIBRARY PATH usr local cuda lib DYLD LIBRARY PATH Also if you are unable to execute CUDA GDB or if you hit the Unable to find Mach task port for processid error try resetting the correct permissions with the following commands sudo chgrp procmod usr local cuda bin cuda binary gdb sudo chmod 2755 usr local cuda bin cuda binary gdb sudo chmod 755 usr local cuda bin cuda gdb Temporary Directory By default CUDA GDB uses tmp as the directory to store temporary files To select a different directory set the TMPDIR environment variable CUDA GDB DU 05227 001_V4 1 6 Chapter 03 GETTING STARTED Compiling the Application Debug Compilation NVCC the NVIDIA CU
27. any and product names may be trademarks or registered trademarks of the respective owners with which they are associated Copyright 2007 2012 NVIDIA Corporation All rights reserved www nvidia com NVIDIA
28. block Maximum sizes of each dimension of a grid Maximum memory pitch Texture alignment Clock rate Concurrent copy and execution Run time Limit on kernels Integrated Support host page locked memory mapping Compute mode Concurrent kernel execution Device has ECC support enabled Device is using TCC driver mode deviceQuery CUDA Driver CUDART CUDA Driver Version 3 20 CUDA Runtime Version 3 20 NumDevs 2 Device Quadro FX 4800 Device GeForce 8800 GT Chapter 03 GETTING STARTED F 3 20 3 20 1 3 1610285056 bytes 24 MP x 8 Cores MP 192 Cores 65536 bytes 16384 bytes 16384 32 512 512 x 512 x 64 65535 x 65535 x 1 2147483647 bytes 256 bytes 1 20 GHz Yes Yes No Yes Default multiple host threads can use this device simultaneously No No No 3 20 3 20 1 1 536674304 bytes 14 MP x 8 Cores MP 112 Cores 65536 bytes 16384 bytes 8192 32 12 512 x 512 x 64 65535 x 65535 x 1 2147483647 bytes 256 bytes 1 50 GHz Yes Yes No Yes Default multiple host threads can use this device simultaneously No No No wrr Figure 3 1 deviceQuery Output Remote Debugging To remotely debug an application use SSH or VNC from the host system to connect to the target system From there CUDA GDB can be launched in console mode CUDA GDB DU 05227 001_V4 1 10 Chapter 03 GETTING STARTED Multiple Debuggers In a multi GPU environment several debugging sessions ma
29. ce 0 Launch of CUDA Kernel 1 example lt lt lt 8 1 1 64 1 1 gt gt gt on Device 0 Swaieechrnqu kocusmeomeCUDAmkeremcilie li cially TOE kR O71 Or 0 smectic ORIOnaO ie device 0 sm 0 warp 0 lane 0 Program received signal CUDA_EXCEPTION 10 Device Illegal Address Ciirceni tocismset to CUDA kernel Word le broek O O thread 82770 0 device 07 sm 1 ware S lane 0 Autostep precisely caught exception at example cu 16 0x796e90 This time we correctly caught the exception at line 16 Even though CUDA_EXCEPTION_10 is a global error we have now narrowed it down to a warp error so we now know that the thread that threw the exception must have been in the same warp as block 3 thread 32 In this example we have narrowed down the scope of the error from 512 threads down to 32 threads just by setting two autosteps and re running the program CUDA GDB DU 05227 001_V4 1 43 SUPPORTED PLATFORMS The general platform and GPU requirements for running NVIDIA CUDA GDB are described in this section Host Platform Requirements Mac OS CUDA GDB is supported on both 32 bit and 64 bit editions of the following Mac OS versions gt Mac OSX 10 6 gt Mac OSX 10 7 Linux CUDA GDB is supported on both 32 bit and 64 bit editions of the following Linux distributions gt Red Hat Enterprise Linux 4 8 64 bit only gt Red Hat Enterprise Linux 5 5 5 6 and 5 7 gt Red Hat Enterprise Linux 6 0
30. cisely reported However the exact instruction that generates the exception can be determined if the program is being single stepped when the exception occurs Manually single stepping through a program is a slow and tedious process Therefore autostep aids the user by allowing them to specify sections of code where they suspect an exception could occur These sections are automatically single stepped through when the program is running and any exception that occurs within these sections is precisely reported Type help autostep from CUDA GDB for the syntax and usage of the command Support For Multiple Contexts On GPUs with compute capability of SM20 or higher debugging multiple contexts on the same GPU is now supported It was a known limitation in previous releases Support for Device Assertions The R285 driver released with the 4 1 version of the toolkit supports device assertions CUDA GDB supports the assertion call and stops the execution of the application when the assertion is hit Then the variables and memory can be inspected as usual The application can also be resumed past the assertion if needed Use the set cuda hide_internal_frames option to expose hide the system call frames hidden by default CUDA GDB DU 05227 001_V4 1 4 GETTING STARTED Included in this chapter are instructions for installing CUDA GDB and for using NVCC the NVIDIA CUDA compiler driver to compile CUDA programs for debugging Inst
31. d to application termination or indetermi Exception code Precision of the Error Scope of the Error Description CUDA_EXCEPTION_O Device Unknown Exception Not precise Global error on the GPU This is a global GPU error caused by the application which does not match any of the listed error codes below This should be a rare occurrence Potentially this may be due to Device Hardware Stack overflows or a kernel generating an exception very close to its termination CUDA_EXCEPTION_1 Lane Illegal Address Precise Requires memcheck on Per lane thread error This occurs when a thread accesses an illegal out of bounds global address Device Hardware Stack Overflow CUDA_EXCEPTION_2 Precise Per lane thread error This occurs when a Lane User Stack thread exceeds its Overflow stack memory limit CUDA_EXCEPTION_3 Not precise Global error on the GPU This occurs when the application triggers a global hardware stack overflow The main cause of this error is large amounts of divergence in the presence of function calls CUDA GDB DU 05227 001_V4 1 33 Chapter 010 Table 10 1 CUDA Exception Codes continued CHECKING MEMORY ERRORS Device Illegal Address Exception code Precision of Scope of the Error Description the Error CUDA_EXCEPTION_4 Not precise Warp error This occurs when any Warp
32. ees cee cee ees eeceeceeeeecs 7 Compiling for Fermi and Tesla GPUS cc ccc cee cee cee cee eeceeceeceecs 7 Using the Debugger cece cece cece eee eseeeeeseeseessessessessesesssssseees 8 Single GPU Debugging cic peaansewaadiesseageaceangienadeadesiaecs ae baseaiuaans 8 Multi GPU Debugging sssssssssssseosesssseossecssecosecoseesseeoserosseo 8 Remote Debugging ssesssessesesssessesscseesocssessessesseseeseeseese 10 M ltiple DebuggerS u accaneeedeapees erweey cereeeepoinouerece terner tensed eitas 11 CUDA OpenGL Interop Applications on LinuX sssesssssssesssessssess 11 4 CUDA GDB Extensions wsscsssscacsssedassessansetessessnaniessaatesedeseeesosenees 12 Command Naming Convention sssssesessscessecsscossscossessecesseesesses 12 SELLING Heeren a IE E E E E 12 Initialization Files seccsscnreceneieritidecie r ene e E E EE E 13 GUI Integrati n sssresccreseesrsccensr renier iieri t Ena EEO OR A RET 13 EMACS ccivecedensniuteneenvereneienesavereis cueees venaase teen eveteewersiaaeeeeeeses 13 Graphics Driver CUDA GDB DU 05227 001_V4 1 i TABLE OF CONTENTS DED ireren iE EEA ERE EREE EEEE eerie EE EEEE EE 13 5 Kernel FOCUS ccscuinsrcsnncnsvinsensnsaptundsrasenniaaeetaiwnendaaeeuses naaa aii 14 Software Coordinates vs Hardware Coordinates sssssssesssessssess 14 C rrent FOCUS csicvectavntecenentananethacerinesehineersaaceaneecsieesewssaneneneenes 14 SWITCHING FOCUS 2c ccecsnvereccoxsene Kee
33. evaluated to FALSE the debugger may appear to be hanging or stuck although it is not the case You can interrupt the application with CTRL C to verify that progress is being made Conditional breakpoints can only be set on code from CUDA modules that are already loaded Otherwide CUDA GDB will report an error that it is unable to find symbols in the current context If unsure first set an unconditional breakpoint at the desired location and add the conditional statement the first time the breakpoint is hit by using the cond command CUDA GDB DU 05227 001_V4 1 20 INSPECTING PROGRAM STATE Memory and Variables The GDB print command has been extended to decipher the location of any program variable and can be used to display the contents of any CUDA program variable including gt data allocated via cudaMalloc gt data that resides in various GPU memory regions such as shared local and global memory gt special CUDA runtime variables such as threadIdx Variable Storage and Accessibility Depending on the variable type and usage variables can be stored either in registers or in local shared const or global memory You can print the address of any variable to find out where it is stored and directly access the associated memory The example below shows how the variable array which is of type shared int can be directly accessed in order to see what the stored values are in the array cuda gdb print amp ar
34. hese sections Thus the exact instruction and thread where an exception occurred can be found quickly and with much less effort by using autostep Usage autostep LOCATION autostep LOCATION for LENGTH lines instructions gt LOCATION may be anything that you use to specify the location of a breakpoint such as a line number function name or an instruction address preceded by an asterisk If no LOCATION is specified then the current instruction address is used gt LENGTH specifies the size of the autostep window in number of lines or instructions lines and instructions can be shortened e g I or i If the length type is not specified then lines is the default If the for clause is omitted then the default is 1 line gt astep can be used as an alias for the autostep command gt Calls to functions made during an autostep will be stepped over gt Incase of divergence the length of the autostep window is determined by the number of lines or instructions the first active lane in each warp executes Divergent lanes are also single stepped but the instructions they execute do not count towards the length of the autostep window gt If a breakpoint occurs while inside an autostep window the warp where the breakpoint was hit will not continue autostepping when the program is resumed However other warps may continue autostepping gt Overlapping autosteps are not supported
35. ion to application cuda gdb set cuda break_on_launch application Possible options are gt application any kernel launched by the user application gt system any kernel launched by the driver such as memset gt all any kernel application and system gt none no kernel application or system Those automatic breakpoints are not displayed by the info breakpoints command and are managed separately from individual breakpoints Turning off the option will not delete other individual breakpoints set to the same address and vice versa CUDA GDB DU 05227 001_V4 1 19 Chapter 07 BREAKPOINTS Conditional Breakpoints To make the breakpoint conditional use the optional if keyword or the cond command cuda gdb break foo cu 23 if threadIdx x 1 amp amp i lt 5 cuda gdb cond 3 threadIdx x 1 amp amp i lt 5 Conditional expressions may refer any variable including built in variables such as threadIdx and blockIdx Function calls are not allowed in conditional expressions Note that conditional breakpoints are always hit and evaluated but the debugger reports the breakpoint as being hit only if the conditional statement is evaluated to TRUE The process of hitting the breakpoint and evaluating the corresponding conditional statement is time consuming Therefore running applications while using conditional breakpoints may slow down the debugging session Moreover if the conditional statement is always
36. m_20 if you are debugging code in device functions that get called by multiple kernels then setting a breakpoint in the device function will insert the breakpoint in only one of the kernels gt Ina multi GPU debugging environment on Mac OS X with Aqua running you may experience some visible delay while single stepping the application gt Setting a breakpoint on a line within a ___device__ or __global__ function before its module is loaded may result in the breakpoint being temporarily set on the first line of a function below in the source code As soon as the module for the targeted function is loaded the breakpoint will be reset properly In the meantime the breakpoint may be hit depending on the application In those situations the breakpoint can be safely ignored and the application can be resumed gt The scheduler locking option cannot be set to on gt Stepping again after stepping out of a kernel results in undetermined behavior It is recommended to use the continue command instead gt OpenGL applications may require to launch X in non interactive mode See CUDA OpenGL Interop Applications on Linux on page 11 for details CUDA GDB DU 05227 001_V4 1 46 Notice ALL NVIDIA DESIGN SPECIFICATIONS REFERENCE BOARDS FILES DRAWINGS DIAGNOSTICS LISTS AND OTHER DOCUMENTS TOGETHER AND SEPARATELY MATERIALS ARE BEING PROVIDED AS IS NVIDIA MAKES NO WARRANTIES EXPRESSED IMPLIED STATUTORY OR OTH
37. n EEES EE EE E E ESEE eee 15 6 Program EXCCUlON iis dcccsdcanscaticetverassdescushisceinssademardatediawsiant 16 Interrupting thie Application sssssessssesssesssccesessesceseeseceeseseees 16 Single Stepping bxixdwncdawantaved endian leas Ein ian a E a EEK EEE E EE 16 7 Brea kpoints ssccssscsaninaceegnasarsenrsidsupinisessauapensenindeaen ones ia a 18 Symbolic Breakpoints ssssesssesssesssessscessscsssossscesssssscesseesesees 18 Line Breakpoints s scsscicssiiriicccrsreinisesicsisoirsiocs cii esi iseseitedes tiati 19 Address Breakpoints ssseessssosereseccsecsesecssscoseccsssosersseeceeseo 19 Kernel Entry Breakpoints sessessssecsecsssecssesosecssscosesesseossecoees 19 Conditional Breakpoints sssssssesssessscessscsssessecessssssceeseeseseee 20 8 Inspecting Program State sssssecsececeesececscsesececoececessesececseseo 21 Memory and Variables ccccccccccccccccccccccccccecccceecceeeceeeeeeencs 21 Variable Storage and Accessibility ssesssosseseossesresresrsssesreeeo 21 Inspecting Textures sessessssssesecessssesesesoesesesecocososesesecssseses 22 Info CUDA Commands sssssessssssscesssesssossscesssssesosssesscessseeess 23 info c da devices ssssisscissisersirsosiserisiseri iritasi ttet dre seisata i sa 23 info CUA SINS scsccciadinnianeeuoeninnninweneskunbsbnbentiuncdiatnnisturonenters 24 info c da WANDS concession tawseersingdcsdiiedenemesdaxcuechuseesseeebesiaeee
38. nies 24 info cuda lanes overs ces veep ee ve ob vead wa ice ae eae wea aa eee 25 info cuda kernels essesoserssesoserssssosercessossrecserssesosersssesoo 25 INFO cuda DIOCKS i idinccvcssoundes eoxetendwwssestawesdeuananoedankmisemvaneowss 26 info cuda threads cici jets cveiutvencasaeeverseietecietrseedsas caw tewentetnensans 27 9 Context and Kernel Events ssessscssesececcoscecoecccescsecocsescseseoee 28 Display CUDA context events sssessssessseessecsseosesessecssceeseeeees 28 Display CUDA kernel events cccccccccceesccsessccccessceesessscuseessnneses 28 Examples of displayed events cccccceseccccerceseseseseccoeeressseses 29 Graphics Driver CUDA GDB DU 05227 001_V4 1 ii TABLE OF CONTENTS 10 Checking Memory Errors ssseseeseeseesecceccecseesecsecseeseesecseeseesee 30 Checking Memory Errors sssssesssssssesssessssesssssssossscessesssceeseesesees 30 Increasing the Precision of Memory Errors WIth Autostep 005 31 USAGE ste i Eee EERE EEEE EEE EACEA EKE EEE EEE EAEE EEEE ENE SEEEN E 31 Related Commands esesssesssesressoseccesecesssoseecsscoseesoseesseseeee 32 GPU Error Rep rtind esere riere me aeni ee e aia i a es 33 11 Walk through Examples sessessesceccecceccecceccecoecceccecoeesecceeoeeo 36 Example 1 bitreverse sssssersssecsersssecssseosecssscoseceeseosseceeee 36 Source Cde viuvevsecivadeucoutand toxadionaatlsuntawesxsiverodtourkemmsreveneownns 36
39. nning OF 138 Gia Or 203 running Bil GOO running 0 L CUDA GDB DU 05227 001_V4 1 26 Chapter 08 INSPECTING PROGRAM STATE info cuda threads This command displays the application s active CUDA blocks and threads with the total count of threads in those blocks Also displayed are the virtual PC and the associated source file and the line number information The results are grouped per kernel The command supports filters with default being kernel current block all thread all The outputs are coalesced by default as follows cuda gdb info cuda threads BlockIdx ThreadIdx To BlockIdx ThreadIdx Count Virtual PC Filename Line Device 0 SM 0 0 0 COMO TON CORO CIPOT 32 OxO000000000088f88c acos cu 376 0 0 0 32 0 0 191 0 0 127 0 0 24544 0x000000000088f800 acos cu 374 Coalescing can be turned off as follows in which case more information is displayed with the output cuda gdb info cuda threads BlockIdx ThreadiIdx Virtual PC Dev SM Wp Ln Filename Line Kernel 1 cs 0 0 0 0 0 0 02000000000088 886 o 0 0 acos cu 376 0070 1 0 0 Ox000000000088 88c Oo o i acos cu 376 Co 0 0 2 0 0 Ox000000000088 f88c Oo 2 ICOS eU 316 07070 3 0 0 Ox000000000088 f88c Oo 38 aces cu 376 0 0 0 4 0 0 Ox000000000088 f88c Oo 0 0 4 acos cu 376 GOFO7 08 5 0 0 Ox000000000088 88c o 0 5 acos cu 376 0 0 0 6 0 0 Ox000000000088 f88c o 5 acos cu 376 COON 7 0 0 Ox000000000088 f
40. nt at the same time Therefore the same breakpoint may be hit several times and the user must be careful with checking which thread s actually hit s the breakpoint Symbolic Breakpoints To set a breakpoint at the entry of a function use the break command followed by the name of the function or method cuda gdb break my function cuda gdb break my_class my_ method For templatized functions and methods the full signature must be given cuda gdb break int my_templatized_ function lt int gt int CUDA GDB DU 05227 001_V4 1 18 Chapter 07 BREAKPOINTS The mangled name of the function can also be used To find the mangled name of a function you can use the following command cuda gdb set demangle style none cuda gdb info function my function_name cuda gdb set demangle style auto Line Breakpoints To set a breakpoint on a specific line number use the following syntax cuda gdb break my file cu 185 If the specified line corresponds to an instruction within templatized code multiple breakpoints will be created one for each instance of the templatized code Address Breakpoints To set a breakpoint at a specific address use the break command with the address as argument cuda gdb break Oxlafe34d0 The address can be any address on the device or the host Kernel Entry Breakpoints To break on the first instruction of every launched kernel set the break_on_launch opt
41. o use the new CUDA commands that have been added to GDB Some walk through examples are also provided It is assumed that the user already knows the basic GDB commands used to debug host applications CUDA GDB DU 05227 001_V4 1 2 RELEASE NOTES The following features have been added for the 4 1 release GDB 7 2 Source Base Until now CUDA GDB was based on GDB 6 6 on Linux and GDB 6 3 5 on Darwin the Apple branch Now both versions of CUDA GDB are using the same 7 2 source base Also CUDA GDB supports newer versions of GCC tested up to GCC 4 5 has better support for DWARF3 debug information and better C debugging support Support For Simultaneous CUDA GDB Sessions With the 4 1 release the single CUDA GDB process restriction is lifted Now multiple CUDA GDB sessions are allowed to co exist as long as the GPUs are not shared between the applications being debugged For instance one CUDA GDB process can debug process foo using GPU 0 while another CUDA GDB process debugs process bar using GPU 1 The exclusive of GPUs can be enforced with the CUDA_VISIBLE_DEVICES environment variable CUDA GDB DU 05227 001_V4 1 3 Chapter 02 RELEASE NOTES New Autostep Command A new autostep command was added The command increases the precision of CUDA exceptions by automatically single stepping through portions of code Under normal execution the thread and instruction where an exception occurred may be impre
42. or the block and thread arguments are optional cuda gdb cuda block 1 thread 3 Swatkchaing Skocus te CUDA kerne limor bioc imam threadi O OMT device 0 sm 3 warp 0 lane 3 374 int totalThreads gridDim x blockDim CUDA GDB DU 05227 001_V4 1 15 PROGRAM EXECUTION Applications are launched the same way in CUDA GDB as they are with GDB by using the run command This chapter describes how to interrupt and single step CUDA applications Interrupting the Application If the CUDA application appears to be hanging or stuck in an infinite loop it is possible to manually interrupt the application by pressing CTRL C When the signal is received the GPUs are suspended and the cuda gdb prompt will appear At that point the program can be inspected modified single stepped resumed or terminated at the user s discretion This feature is limited to applications running within the debugger It is not possible to break into and debug applications that have been launched outside the debugger Single Stepping Single stepping device code is supported However unlike host code single stepping device code single stepping works at the warp level This means that single stepping a device kernel advances all the active threads in the warp currently in focus The divergent threads in the warp are not single stepped In order to advance the execution of more than one warp a breakpoint must be set at the desired location and
43. point will be resolved cannot be controlled With the driver API the user can control the instance to which the breakpoint will be resolved to by setting the breakpoint right after its module is loaded CUDA GDB DU 05227 001_V4 1 8 Chapter 03 GETTING STARTED Multi GPU Debugging in Console Mode CUDA GDB allows simultaneous debugging of applications running CUDA kernels on multiple GPUs In console mode CUDA GDB can be used to pause and debug every GPU in the system You can enable console mode as described above for the single GPU console mode Multi GPU Debugging with the Desktop Manager Running This can be achieved by running the desktop GUI on one GPU and CUDA on the other GPU to avoid hanging the desktop GUI On Linux The CUDA driver automatically excludes the GPU used by X11 from being visible to the application being debugged This prevents the behavior of the application since if there are n GPUs in the system then only n 1 GPUs will be visible to the application On Mac OS X The CUDA driver exposes every CUDA capable GPU in the system including the one used by Aqua desktop manager To determine which GPU should be used for CUDA run the deviceQuery app from the CUDA SDK sample The output of deviceQuery as shown in Figure 3 1 indicates all the GPUs in the system For example if you have two GPUs you will see Device0 GeForce xxxx and Devicel GeForce xxxx Choose the Device lt index gt that is not rendering
44. ray 1 shared int 0 0x20 cuda gdb print array 0 4 S43 0 128 Ga 192 You can also access the shared memory indexed into the starting offset to see what the stored values are cuda gdb print shared int 0x20 Sa 0 cuda gdb print shared int 0x24 SA 128 cuda gdb print shared int 0x28 85 64 CUDA GDB DU 05227 001_V4 1 21 Chapter 08 INSPECTING PROGRAM STATE The example below shows how to access the starting address of the input parameter to the kernel cuda gdb print amp data 6 const global void const parameter 0x10 cuda gdb print global void const parameter 0x10 7 global void const parameter 0x110000 Inspecting Textures Note The debugger can always read write the source variables when the PC is on the first assembly instruction of a source instruc tion When doing assembly level debugging the value of source variables is not always accessible To inspect a texture use the print command while de referencing the texture recast to the type of the array it is bound to For instance if texture tex is bound to array A of type float use cuda gdb print texture float tex All the array operators such as can be applied to texture float tex cuda gdb print texture float tex 2 cuda gdb print texture float tex 2 4 CUDA GDB DU 05227 001_V4 1 22 Chapter 08 INSPECTING PROGRAM
45. the desktop on your connected monitor If DeviceO is rendering the desktop then choose Devicel for running and debugging the CUDA application This exclusion of the desktop can be achieved by setting the CUDA_VISIBLE_DEVICES environment variable to 1 export CUDA VISIBLE DEVICES 1 CUDA GDB DU 05227 001_V4 1 9 There are 2 devices supporting CUDA Device Quadro FX 4800 CUDA Driver Version CUDA Runtime Version CUDA Capability Major Minor version number Total amount of global memory Multiprocessors x Cores MP Cores Total amount of constant memory Total amount of shared memory per block Total number of registers available per block Warp size Maximum number of threads per block Maximum sizes of each dimension of a block Maximum sizes of each dimension of a grid Maximum memory pitch Texture alignment Clock rate Concurrent copy and execution Run time limit on kernels Integrated Support host page locked memory mapping Compute mode Concurrent kernel execution Device has ECC support enabled Device is using TCC driver mode Device 1 GeForce 8800 GT CUDA Driver Version CUDA Runtime Version CUDA Capability Major Minor version number Total amount of global memory Multiprocessors x Cores MP Cores Total amount of constant memory Total amount of shared memory per block Total number of registers available per block Warp size Maximum number of threads per block Maximum sizes of each dimension of a
46. then the application must be fully resumed A special case is single stepping over a thread barrier call __syncthreads In this case an implicit temporary breakpoint is set immediately after the barrier and all threads are resumed until the temporary breakpoint is hit On GPUs with sm_type lower than sm_20 it is not possible to step over a subroutine in the device code Instead CUDA GDB always steps into the device function On GPUs with sm_type sm_20 and higher you can step in over or out of the device functions as CUDA GDB DU 05227 001_V4 1 16 Chapter 06 PROGRAM EXECUTION long as they are not inlined To force a function to not be inlined by the compiler the __ __noinline _ keyword must be added to the function declaration CUDA GDB DU 05227 001_V4 1 17 BREAKPOINTS There are multiple ways to set a breakpoint on a CUDA application Those methods are described below The commands to set a breakpoint on the device code are the same as the commands used to set a breakpoint on the host code If the breakpoint is set on device code the breakpoint will be marked pending until the ELF image of the kernel is loaded At that point the breakpoint will be resolved and its address will be updated When a breakpoint is set it forces all resident GPU threads to stop at this location when it hits that corresponding PC When a breakpoint is hit by one thread there is no guarantee that the other threads will hit the breakpoi
47. upported features CUDA GDB is designed to present the user with a seamless debugging environment that allows simultaneous debugging of both GPU and CPU code within the same application Just as programming in CUDA C is an extension to C programming debugging with CUDA GDB is a natural extension to debugging with GDB The existing GDB debugging features are inherently present for debugging the host code and additional features have been provided to support debugging CUDA device code CUDA GDB supports C and C CUDA applications All the C features supported by the NVCC compiler can be debugged by CUDA GDB CUDA GDB allows the user to set breakpoints to single step CUDA applications and also to inspect and modify the memory and variables of any given thread running on the hardware CUDA GDB supports debugging all CUDA applications whether they use the CUDA driver API the CUDA runtime API or both CUDA GDB DU 05227 001_V4 1 1 Chapter 01 INTRODUCTION CUDA GDB supports debugging kernels that have been compiled for specific CUDA architectures such as sm_10 or sm_20 but also supports debugging kernels compiled at runtime referred to as just in time compilation or JIT compilation for short About this document This document is the main documentation for CUDA GDB and is organized more as a user manual than a reference manual The rest of the document will describe how to install and use CUDA GDB to debug CUDA kernels and how t
48. y take place simultaneously as long as the CUDA devices are used exclusively For instance one instance of CUDA GDB can debug a first application that uses the first GPU while another instance of CUDA GDB debugs a second application that uses the second GPU The exclusive use of a GPU is achieved by specifying which GPU is visible to the application by using the CUDA_VISIBLE_DEVICES environment variable CUDA_VISIBLE_DEVICES 1 cuda gdb my app CUDA OpenGL Interop Applications on Linux Any CUDA application that uses OpenGL interoperability requires an active windows server Such applications will fail to run under console mode debugging on both Linux and Mac OS X However if the X server is running on Linux the render GPU will not be enumerated when debugging so the application could still fail unless the application uses the OpenGL device enumeration to access the render GPU But if the X session is running in non interactive mode while using the debugger the render GPU will be enumerated correctly Instructions 1 Launch your X session in non interactive mode a Stop your X server b Edit etc X11 xorg conf to contain the following line in the Device section corresponding to your display Option Interactive off c Restart your X server 2 Log in remotely SSH etc and launch your application under CUDA GDB This setup works properly for single GPU and multi GPU configurations 3 Ensure your DISPLAY environment variable is set
49. y threadIdx x O0xf0Of0f0f0 amp array threadIdx x gt gt 4 cuda gdb next 16 array threadIdx x Oxcccccccce amp array threadIdx x gt gt 2 cuda gdb next 18 array threadIdx x ll Oxaaaaaaaa amp array threadIdx x gt gt 1 cuda gdb next Breakpoint 3 bitreverse lt lt lt 1 1 256 1 1 gt gt gt data 0x100000 at bitreverse cu 21 21 idata threadIdx x array threadIdx x cuda gdb print array 0 12 SI 0 128 G4 192 34 60 965 224 16 aad BO Zs cuda gdb print x array 0 12 8 0x0 0x80 0x40 Oxc0 0x20 Oxa0l 0x60 Oxe0 Oxl0 0x90 0x50 Oxd0 cuda gdb print amp data 9 global void parameter 0x10 cuda gdb print global void parameter 0x10 10 global void parameter 0x100000 The resulting output depends on the current content of the memory location CUDA GDB DU 05227 001_V4 1 39 Chapter 011 WALK THROUGH EXAMPLES 10Since thread 0 0 0 reverses the value of 0 switch to a different thread to show more interesting data cuda gdb cuda thread 170 Switching etocucsmtonGUDAwkernelN Oo itcy a blocks 0 0 0 thread OOO devalce 107 sm 0 7 warp S lane 0 11 Delete the breakpoints and continue the program to completion cuda gdb delete breakpoints Delete all breakpoints y or n cuda gdb continue Continuing Y Program exited normally cuda gdb CUDA GDB DU 05227 001_V

Download Pdf Manuals

image

Related Search

Related Contents

JIS Q 9100 - あどばる経営研究所/中小企業診断士 Office  SCE-03 RS-232C インタフェース+コンパレータ出力 取扱説明書    2 - Icel Manaus  USER MANUAL - cardvr.co.uk  Formation  Tricity Bendix CSI 2500 User's Manual  José Maria da Fonseca Vinhos, SA - RUN  Instrucciones de Instalación  USER MANUAL  

Copyright © All rights reserved.
Failed to retrieve file