Home

CUDA-MEMCHECK

1. Host Frame memcheck demo Z20o0ut_of bounds kernelv 0x9 1021052 Host Frame memcheck demo _4Z17run_out_of boundsv Ox76 TOLG SSSSSSS Host Frame memcheck demo main Ox2d 0Oxfd3 HOSE FP Gane lies Tibe 50 6 libe start meim i Oxfd Oxlebld SSSSSSS5 Host Frame memcheck demo 0xd79 Program hit error 17 on CUDA API call to cudaFree SSSSSS555 Saved host backtrace up to driver entry point at gt Host Frame usr local lib libcuda so 0x28 850 S S Host Frame usr local lib libcudart so 5 0 cudaFree Ox20d 0x364ed Host Frame memcheck demo main 0x3e 0xfe4 Host Preme luogh liioe s60 6 _ libe start maim i Oxfd Oxlebld SSSS 5 Host Frame memcheck demo 0xd79 FRROR SUMMARY 3 errors 6 1 4 Leak Checking in CUDA MEMCHECK To print information about the allocations that have not been freed at the time the CUDA context is destroyed we can specify the leak check full option to CUDA MEMCHECK www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 25 CUDA MEMCHECK Tool Examples When running the program with the leak check option the user is presented with a list of allocations that were not destroyed along with the size of the allocation and the address on the device of the allocation For allocations made on the host each leak report will also print a backtrace corres
2. Oxde9 Ox28 850 HOCUCEICE SOs Do 0 O0x106d Libe start men s allocations DU 05355 001_v5 0 27 CUDA MEMCHECK Tool Examples 6 2 Integrated CUDA MEMCHECK Example This example shows how to enable CUDA MEMCHECK from within CUDA GDB and how to detect errors within the debugger so you can access the line number information and check the state of the variables In this example the unaligned kernel has a misaligned memory access in block 1 lane 1 which gets trapped as an illegal lane address at line 6 from within CUDA GDB Note that CUDA GDB displays the address and that caused the bad access cuda gdb set cuda memcheck on cuda gdb run Starting program memcheck_ demo Thread debugging using libthread_db enabled Mallocing memory New Thread Ox 7ffff6fel710 LWP 7783 Context Create of context 0x6218a0 on Device 0 Launch ot CUDA kernel O memsets2 pos t lt lt lt Gly 1 1 64 1 1 gt gt gt on Device 0 Running unaligned kernel Launch Cac CUDA kernel i unaligned kernels lt lt 1 1 1 1 1 1 See on Device 0 Memcheck detected an illegal access to address global 0x400100001 Program received sical CUDA PXCHETION l bane Illegal Address Switching focus to CUDA kernel 1 grid 2 block 0 0 0 thread 0 0 0 device 0 sm 0 warp 0 lane 0 0000000000078560 aim unalicmecl kernel lt lt lt il
3. 3 4 Understanding Memcheck Errors The memcheck tool can produce a variety of different errors This is a short guide showing some samples of errors and explaining how the information in each error report can be interpreted 1 Memory access error Memory access errors are generated for errors that the memcheck tool can correctly attribute and identify the erroneous instruction Below is an example of a precise memory access error www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 9 Memcheck Tool lovelicl lool writes Oi Size 4 SSS SS5 at 0x00000060 in memcheck demo cu 6 unaligned_ kernel void by chreaci 0 0 0 in block 0 0 0 Address 0x400100001 is misaligned Let us examine this error line by line Invalid _ global write Ox Size 4 The first line shows the memory segment type and size being accessed The memory segment is one of gt _ global_ for device global memory gt __shared__ for per block shared memory gt _ local__ for per thread local memory In this case the access was to device global memory The next field contains information about the type of access whether it was a read or a write In this case the access is a write Finally the last item is the size of the access in bytes In this example the access was 4 bytes in size at 0x00000060 in memcheck_ demo cu 6 unaligned_ kernel void The second line contains the PC of the instruction the source file and line number
4. MEMCHECK enabled gt On Windows XP the standalone CUDA MEMCHECK tools will always run in blocking launch mode gt When running CUDA MEMCHECK tools in integrated mode with CUDA GDB only the memcheck tool is enabled Also the following features are disabled gt Nonblocking launches gt Leak checking gt API error checking gt CUDA MEMCHECK tools do not support CUDA Direct3D interop gt The memcheck tool does not support CUDA API error checking for API calls made on the GPU using dynamic parallelism gt The racecheck tool does not support dynamic parallelism or SM 3 5 www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 32 Notice ALL NVIDIA DESIGN SPECIFICATIONS REFERENCE BOARDS FILES DRAWINGS DIAGNOSTICS LISTS AND OTHER DOCUMENTS TOGETHER AND SEPARATELY MATERIALS ARE BEING PROVIDED AS IS NVIDIA MAKES NO WARRANTIES EXPRESSED IMPLIED STATUTORY OR OTHERWISE WITH RESPECT TO THE MATERIALS AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE Information furnished is believed to be accurate and reliable However NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation Specifications mentioned in this public
5. Make sure to look at the device side backtrace to find the location in the application where the malloc free call was made 3 8 Leak Checking The memcheck tool can detect leaks of allocated memory Memory leaks are device side allocations that have not been freed by the time the context is destroyed The memcheck tool tracks device memory allocations created using the CUDA driver or runtime APIs Starting in CUDA 5 allocations that are created dynamically on the device heap by calling malloc inside a kernel are also tracked For an accurate leak checking summary to be generated the application s CUDA context must be destroyed at the end This can be done explicitly by calling cuCtxDestroy in applications using the CUDA driver API or by calling cudaDeviceReset in applications programmed against the CUDA run time API The leak check full option must be specified to enable leak checking www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 13 Chapter 4 RACECHECK TOOL 4 1 What is Racecheck The racecheck tool is a run time shared memory data access hazard detector The primary use of this tool is to help identify memory access race conditions in CUDA applications that use shared memory In CUDA applications storage declared with the shared qualifier is placed in on chip shared memory All threads in a thread block can access this per block shared memory Shared memory goes out of scope when the thread block complet
6. TABLE OF CONTENTS Chapter 1 Introductio sessirnar sccsccacs cacesdcecescasmesdceed sac ceccasceddaceetdecticacmescccaasec cance 1 1 1 About CUDA MEMCHECK sess scvsescsae soc iin ovate vise sostuien scutes san bau ceeds sbinun E eleva cy 1 142 Why CUDA MEMCHECK scsi teassices sia OEREO REEE EERE SER SERORA SORRERAN 1 1 3 New Features m DO ereisieiisereisree inatet intren nA CEKET a TETERE ARTES 2 1 4 How to Get CUDA MEMCHECK cccece cece eee e cece eee eee eras ossessione nso rok iare sass 2 1 5 CUDA gt MEMCHECK t00ISisccssciesndidcvscsmeadeseanoases caseisann seta sinnisnewseiadis deainibsetoulesaeansaenessle 2 Chapter 2 Using CUDA MEMCHECK isscseeessccsanseuesscvscccceascasitasasaacaneaaaanedacena sadeadedseasenas 4 2 1 Command Line Options lt scs ciseoscecrssnees or eE e ES E coke vances OEE E TEE E EA 4 2 2 Supported Operating SysteMs ccsesecssccccecceenceseeessesecesneeeaeeseseeseesaeeeeeessenes 6 2 3 Supported DEVICES iccssvced sosivnavediineedensecedevaubasvecssiwiewsnwedleccsise eawaretees seeeese wee eden 6 2 4 Compilation OPtiONSsss6iccs lt cveeead sed acevedswsavadde ace dovegeddusudanaeererenasheadedd EEEE EE aA ai 6 Chapter 3 Memcheck TOOlsisccesscicssecdeasdauessceastedeescciaueslcu sadsuideeccdsssecdeasdiduaedcnadescuaadeces 8 3 1 What 1S M MChe cK 2 csssntivae ssw rion rronin ie tha sand lee veueaelnien sduens cade E N E a e 6 8 3 2 Supported Error DetectiOnisicss ciccscniseaccassancsescessatansc
7. Not precise Warp error This occurs when any thread within Invalid PC a warp advances its PC beyond the 40 bit address space CUDA_EXCEPTION_9 Warp Not precise Warp error This occurs when any thread in Hardware Stack Overflow a warp triggers a hardware stack overflow This should be a rare occurrence CUDA_EXCEPTION_10 Not precise Global error This occurs when a thread accesses Device Illegal Address an illegal out of bounds global address CUDA_EXCEPTION_11 Lane Precise Per lane thread This occurs when a thread accesses Misaligned Address error a global address that is not correctly aligned CUDA_EXCEPTION_12 Warp Precise Per warp This occurs when any thread in the Assert warp hits a device side assertion CUDA_EXCEPTION_13 Lane Precise Per lane This occurs when a particular Syscall Error thread causes an syscall error such as calling free ina kernel on an already free d pointer Unknown Exception Not precise Global error The precise cause of the exception is unknown Potentially this may be due to Device Hardware Stack overflows or a kernel generating an exception very close to its termination www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 31 Appendix C KNOWN ISSUES The following are known issues with the current release gt Applications run much slower under CUDA MEMCHECK tools This may cause some kernel launches to fail with a launch timeout error when running with CUDA
8. Shared memory hazard Device Contine application Error reported No other action taken www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 20 Chapter 6 CUDA MEMCHECK TOOL EXAMPLES 6 1 Example Use of Memcheck This section presents a walk through of running the memcheck tool from CUDA MEMCHECK on a simple application called memcheck_demo Depending on the SM type of your GPU your system output may vary memcheck_demo cu source code include lt stdio h gt device lime 7 global void unaligned kernel void aone chai es se S a __ device void out_of bounds function void Gime Oxs7GS4S20 Az _ global void out_of bounds kernel void out_of bounds function void run_unaligned void printf Running unaligned kernel n unaligned _kernel lt lt lt 1 1 gt gt gt printf Ran unaligned kernel s n cudaGetErrorString cudaGetLastError printf Syne s n cudaGetErrorString cudaThreadSynchronize void run_out_of bounds void printf Running out_of bounds_kernel n out_of bounds _kernel lt lt lt 1 1 gt gt gt printf Ran out_of_bounds_kernel s n cudaGetErrorString cudaGetLastError printf Syne s n cudaGetErrorString cudaThreadSynchronize www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 21 CUDA MEMCHECK Tool Examples int main int devMem printf Mallocing memory n cudaMalloc void amp devMem 1024 r
9. Precise error detection for local loads stores shared loads stores global atomics reductions On SM 3 5 added precise memory access error detection for noncoherent global loads through the texture unit For more information see Memory Access Error Reporting Error detection in device side malloc free such as double free or invalid free on the GPU For more information see Device Side Allocation Checking Leak checking for allocations on the device heap For more information see Leak Checking Display of a saved stack backtrace on the host and captured backtrace on the device for different errors For more information see Stack Backtraces Reporting of CUDA APT errors in the user s application For more information see CUDA API Error Checking Added display of mangled demangled and full prototype of the kernel For more information see Name Demangling Increased functionality in integrated mode with CUDA GDB Added reporting of the address and address space being accessed that caused a precise exception Added checking of device side malloc and free when in integrated mode For more information see Integrated Mode Support for applications compiled separately that use the device side linker Support for applications compiled with the lineinfo flag New style of command line options For more information see Command Line Options Changed default behavior CUDA MEMCHECK will display backtraces by default and will repo
10. detailing information about hazards in the application The tool is byte accurate and so produces a message for each byte on which a hazard was detected Additionally when enabled the host backtrace for the launch of the kernel will also be displayed A sample racecheck hazard is below ERROR Potential WAW hazard detected at shared 0x0 in block 0 0 0 www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 15 Racecheck Tool Write Thread 0 0 0 at 0x00000088 in raceWAW cu 18 WAW void Write Thread 1 0 0 at 0x00000088 in raceWAW cu 18 WAW void S Current Value 0 Incoming Value 2 The hazard records are dense and capture a lot of interesting information In general terms the first line contains information about the hazard its severity type and address as well as information about the thread block where it occurred The next 2 lines contain detailed information about the two threads that were in contention These two lines are ordered chronologically so the first entry is for the access that occurred earlier and the second for the access that occurred later The final line is printed for some hazard types and captures the actual data that was being written Examining this line by line we have ERROR Potential WAN hazard detected at shared 0x0 in block OF Oro The first word on this line indicates the severity of this hazard This can be gt INFO The lowest level of sev
11. A MEMCHECK leak check full no Prints information about all allocations that have not been freed via cudaFree at the point when the context was destroyed For more information see Leak Checking report api errors yes no yes Report errors if any CUDA API call fails For more information see CUDA API Error Checking 2 2 Supported Operating Systems The standalone CUDA MEMCHECK binary is supported on all CUDA supported platforms i e Windows Mac OS X and supported Linux distributions CUDA MEMCHECK can interoperate with CUDA GDB on Linux and Mac OS X 2 3 Supported Devices The CUDA MEMCHECK tool suite is supported on all CUDA capable GPUs with SM versions 1 1 and above Individual tools may support a different subset of GPUs The table below contains the list of current tools and their supported GPUs Table 4 Supported Devices by CUDA MEMCHECK tool cee a e a 2 4 Compilation Options The CUDA MEMCHECK tools memcheck and racecheck do not need any special compilation flags to function The output displayed by the CUDA MEMCHECK tools is more useful with some extra compiler flags The G option to nvcc forces the compiler to generate debug information for the CUDA application To generate line number information for applications without affecting the optimization level of the output the 1ineinfo option to nvcc can be used The CUDA MEMCHECK tools fully support both of these options and can display source attribution of errors
12. A MEMCHECK Command line Options ccc eee eee cece eee eeeeeeeeeeeeeeeeeeeeeeeee 4 Table 3 Memcheck Tool Command line Options cece eee eee e eee eee eee eeeeeeeeeeeeeeeeeeeees 5 Table 4 Supported Devices by CUDA MEMCHECK tool ceeeceecee eee eeee eee eeeeeeeeeeeeeeseeeeees 6 Table 5 Memcheck reported error typesS sssesssssssscecessesescccssssesececesssesecceceseeseeceeoe 8 Table 6 CUDA MEMCHECK Stack Backtrace Information sceeecceee cece eeeeeeeeeeeeeteeeees 19 Table 7 CUDA MEMCHECK Error Actions ccccecceecceeee eee ence eee e eee eeeeeeeeeeeeeeeeeeeeeeees 20 Table 8 Memcheck memory accesss error detection SUPPOFt cece cece eeeceeeeeeeeeeeeeeeeees 29 Table 9 CUDA Exception COd S iii sccicviavasvesviis sonics sane ne evuedaticice en EEEE EEEN EENE i 30 www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 iv Chapter 1 INTRODUCTION 1 1 About CUDA MEMCHECK CUDA MEMCHECK is a functional correctness checking suite included in the CUDA toolkit This suite contains multiple tools that can perform different types of checks The memcheck tool is capable of precisely detecting and attributing out of bounds and misaligned memory access errors in CUDA applications The tool also reports hardware exceptions encountered by the GPU The racecheck tool can report shared memory data access hazards that can cause data races This document describes the usage of these tools C
13. UDA MEMCHECK can be run in standalone mode where the user s application is started under CUDA MEMCHECK The memcheck tool can also be enabled in integrated mode inside CUDA GDB 1 2 Why CUDA MEMCHECK NVIDIA allows developers to easily harness the power of GPUs to solve problems in parallel using CUDA CUDA applications often run thousands of threads in parallel Every programmer invariably encounters memory access errors and thread ordering errors that are hard to detect and time consuming to debug The number of such errors increases substantially when dealing with thousands of threads The CUDA MEMCHECK suite is designed to detect such errors in your CUDA application Using the memcheck tool CUDA MEMCHECK can identify memory access errors as well as hardware reported program errors The racecheck tool in CUDA MEMCHECK can identify hazards caused by race conditions in the CUDA program www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 1 Introduction 1 3 New Features in 5 0 gt Reporting of data access hazards in shared memory accesses This is supported on Fermi SM 2 x and Kepler SM 3 0 GPUs This functionality is not supported on Windows XP For more information see Racecheck Tool Support for SM 3 0 and SM 3 5 GPUs For more information see Supported Devices Support for dynamic parallelism All memory access error detection is supported for applications using dynamic parallelism For more information see Dynamic Parallelism
14. a4 Host Frame memcheck demo 739 device stub _ Zlounaligned kernelvv Oxld 0x101d Host Frame memcheck demo _Z1l6unaligned_ kernelv 0x9 0x1028 www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 24 CUDA MEMCHECK Tool Examples Host Frame memcheck demo _Z13run unalignedv 0x76 Oxeaa Host Frame memcheck demo main 0x28 Oxfce Host Premess lisbgA libe s06 libe scart maim Host Frame memcheck demo 0xd79 vali global wreice or sive 4 at 0x00000028 in memcheck demo cu 10 out_of bounds function void m by tlaceacl 0 0 0 imn block 0 0 0 Address 0x87654320 is out of bounds Device Frame memcheck demo cu 15 out_of bounds kernel void out_of bounds kernel void 0x10 Saved Host back race Up to driver entry point at kernel launch time Host Frame usr local lib libcuda so cuLaunchKernel Ox3ae 0xddbee Host Frame usr local lib libcudart so 5 0 0xcd27 Host Frame usr local lib libcudart so 5 0 cudaLaunch Oxlbb 0x3778b Host Frame memcheck_ demo Z210cudaLaunchIcE9cudaErrorPT 0x18 0x11la4 SSSSSSS5 Host Frame memcheck_ demo 7359 device stub Z200ut_of bounds _kernelvv Oxld 0x1047
15. areiaeseddeesans evadeeeronsevadeeaaaenes 8 3 3 Using MEMCHECK sis aici cnnssetnetenndatniandsovets detwatesecasabontee wn sunt teatessdtaumabanimoosemnatee re 9 3 4 Understanding Memcheck Errors cceeeeeee cece eee eeee eee ee eeeeseeeeeeeee sees eeeeeeeeeeees 9 B 5 Integrated MOdeGisieccscnd E E E E E EE E E ies E E E T 12 3 6 CUDA API Error Checking ssic0 cssus sesiciccnnsitesten semaminnsinee els seciee saantel sacs s cade a 12 3 7 Device Side Allocation Checking sssssossssssessssssssssssessessessessesssssssssssessssesee 12 3 0 Leak CHECKING ice ssccsineasaanae sn Eeen E EEEE RE EENE 13 Chapter 4 Racecheck T0Ol isis sis ccncasznnnaciancuddamecsaaasideanamesaawuicegnweenacwesanwenaceqcaaaceasaneees 14 4 1 What is Racech ck tisisascsicnsciadeneaiacan niekeen ENTRO EENE EROE EEE ANSE EREEREER sts 14 A2 What are HaZards icincccissisncecetincnuwesetevan EEE E E AE NEE Et 14 4 3 USING Racecheckississenreescioi n redona shee E EE E E E EEE E TENY 15 4 4 Understanding Racecheck Reports ssssssssssssesssessesssssesseosssssossossossssosoosssosooeo 15 Chapter 5 CUDA MEMCHECK Features siseccicsiieiccis suis siadecatweancineddcasaaennic dd dwdauaeweeadwubuadai dean 18 5 1 Nonblocking Mode ciateceees cuveeceheaed shes cditvacdcudeeciwas ceaveed Chaaetieds ctveaeteeeestiines eevee 18 5 2 Stack BACKtlACES se sicicvewearccrseesascudanaun sesiacapassiusisvaanevsiaes vena decks OTEA ERRETA 18 5 3 Name Demangling ss cvide
16. ation are subject to change without notice This publication supersedes and replaces all other information previously supplied NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation Trademarks NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U S and other countries Other company and product names may be trademarks of the respective companies with which they are associated Copyright 2007 2012 NVIDIA Corporation All rights reserved eo www nvidia com NVIDIA
17. eck tool is byte accurate the message is only for the byte of memory at given address In this example the byte being accessed is byte 0x0 in shared memory Finally the first line contains the block index of the thread block to which the two racing threads belong www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 16 Racecheck Tool The second line contains information about the first thread to write to this location Write Thread 0 0 0 at 0x00000088 in raceWAW cu 18 WAW void The first item on this line indicates the type of access being performed by this thread to the shared memory address In this example the thread was writing to the location The next component is the index of the thread the thread block In this case the thread is at index 0 0 0 Following this we have the byte offset of the instruction which did the access in the kernel In this example the offset is 0x88 This is followed by the source file and line number if line number information is available The final item on this line is the name of the kernel that was being executed The third line contains similar information about the second thread which was causing this hazard This line has an identical format to the previous line The fourth line contains information about the data in the two accesses Current Value 0 Incoming Value 2 If the second thread in the hazard was performing a write access i e the hazard is a Write After Write WAW
18. emcheck demo _Zl6unaligned_kernelv 0x9 0x10b1 Host Frame memcheck demo _Z13run_unalignedv 0x76 LOsars3 s Host Frame memcheck demo main 0x28 0x1057 dost Treme lisgA lilbec s06 lise start malin i Oxfd Oxlebld SS SSSSSas Host Frame memcheck demo 0xde9 E size 4 Tavalidl global wrirce at 0x00000028 in memcheck demo cu 10 out_of bounds function void by careacl 0 0 0 aia block 0 0 0 Address 0x87654320 is out of bounds Device Frame memcheck demo cu 15 out_of bounds kernel void out Of bounds kernel void 0x10 www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 26 kernel launch time Saved host backtrace up to CUDA MEMCHECK Tool Examples driver entry point at Host Frame usr local lib libcuda so cuLaunchKernel Ox3ae 0xddbee SSSSSSsaa Host Frame usr local lib libcudart so 5 0 0xcd27 SSSSssse Host Frame usr local lib libcudart so 5 0 cudaLaunch Oxlbb 0x3778b SSSSSSS Host Frame memcheck demo _ Z210cudaLaunchIcE9cudaErrorPT 0x18 0x122c SSSSSSSS Host Frame memcheck demo 239 device stub Z200ut_of bounds _kernelvv Oxld 0x10d0 Host Frame memcheck demo _Z20o0ut_of bounds _kernelv 0x9 0x10db Host Frame memcheck demo _217run_out_of boundsv 0x76 O
19. erity This is for hazards that have no impact on program execution and hence are not contributing to data access hazards It is still a good idea to find and eliminate such hazards gt WARNING Hazards at this level of severity are determined to be programming model hazards however may be intentionally created by the programmer An example of this are hazards due to warp level programming that make the assumption that threads are proceeding in groups Such hazards are typically only encountered by advanced programmers In cases where a beginner programmer encounters such errors he should treat them as sources of hazards gt ERROR The highest level of severity Correspond to hazards that are very likely candidates for causing data access races Programmers would be well advised to examine errors at this level of severity In this case the message is at the ERROR level of severity The next piece of information here is the type of hazard The racecheck tool detects three types of hazards gt WAW or Write After Write hazards gt WAR or Write After Read hazards gt RAW or Read After Write hazards The type of hazard indicates the accesses types of the two threads that were in contention In this example the hazard is of Write After Write type The next piece of information is the address in shared memory that was being accessed This is the offset in per block shared memory that was being accessed by both threads Since the racech
20. es execution As shared memory is on chip it is frequently used for inter thread communication and as a temporary buffer to hold data being processed As this data is being accessed by multiple threads in parallel incorrect program assumptions may result in data races Racecheck is a tool built to identify these hazards and help users write programs free of shared memory races Currently this tool only supports detecting accesses to on chip shared memory For supported architectures see Supported Devices 4 2 What are Hazards A data access hazard is a case where two threads attempt to access the same location in memory resulting in nondeterministic behavior based on the relative order of the two accesses These hazards cause data races where the behavior or the output of the application depends on the order in which all parallel threads are executed by the hardware Race conditions manifest as intermittent application failures or as failures when attempting to run a working application on a different GPU The racecheck tool identifies three types of canonical hazards in a program These are gt Write After Write WAW hazards www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 14 Racecheck Tool This hazard occurs when two threads attempt to write data to the same memory location The resulting value in that location depends on the relative order of the two accesses gt Read After Write RAW hazards This hazard occurs when tw
21. fault the application will run so that the kernel is terminated on memory access errors but other work in the CUDA context can still proceed In the output below the first kernel no longer reports an unspecified launch failure as its execution has been terminated early after CUDA MEMCHECK detected the error The application continued to run the second kernel The error detected in the second kernel causes it to terminate early Finally the application calls cudaDeviceReset which destroys the CUDA context and then attempts to call cudaFree This call returns an API error that is caught and displayed by memcheck cuda memcheck memcheck_demo CUDA MEMCHECK Mallocing memory Running unaligned kernel Ran unaligned kernel no error DIV Cll Ome Ta ta ira Running out_of bounds kernel Ran owt Cue lovoituavelss kernels NO Srror SWS mO Error invali global write ize 4 at 0x00000028 in memcheck demo cu 6 unaligned_ kernel void low iarceacl 0 0 0 ain Ioloel 0 0 0 Address 0x400100001 is misaligned Saved host backtrace up to driver entry point at kernel launch time Host Frame usr local lib libcuda so cuLaunchKernel Ox3ae O0xddbee Host Frame usr local lib libcudart so 5 0 0xcd27 Host Frame usr local lib libcudart so 5 0 cudaLaunch Oxlbb 0x3778b SSSSSS555 Host Frame memcheck demo Z210cudaLaunchIicE9cudaErrorPT 0x18 0x1l1l
22. for applications compiled with line information The stack backtrace feature of the CUDA MEMCHECK tools is more useful when the application contains function symbol names For the host backtrace this varies based on the host OS On Linux the host compiler must be given the rdynamic option to retain function symbols On Windows the application must be compiled for debugging i e the Zi For the device backtrace the full frame information is only available when www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 6 Using CUDA MEMCHECK the application is compiled with device debug information The compiler can skip generation of frame information when building with optimizations Sample command line to build with function symbols and device side line information on linux avee KConpoilec recynamie lineinto Ove si CU www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 7 Chapter 3 MEMCHECK TOOL 3 1 What is Memcheck The memcheck tool is a run time error detection tool for CUDA applications The tool can precisely detect and report out of bounds and misaligned memory accesses to global local shared and global atomic instructions in CUDA applications It can also detect and report hardware reported error information In addition the memcheck tool can detect and report memory leaks in the user application 3 2 Supported Error Detection The errors that can be reported by the memcheck tool are summarized in the table be
23. heap that were not freed when the CUDA context was destroyed An example of a cudaMalloc allocation leak report follows Leaked 64 bytes at 0x400200200 The error message reports information about the size of the allocation that was leaked as well as the address of the allocation on the device A device heap leak message will be explicitly identified as such www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 11 Memcheck Tool Leaked 16 bytes at 0x4012ffff6 on the device heap 5 CUDA API error CUDA API errors are reported for CUDA API calls that return an error value An example of a CUDA API error Program hit error 11 on CUDA API call to cudaMemset The message contains the returned value of the CUDA API call as well as the name of the API function that was called 3 5 Integrated Mode You can execute the memcheck tool from within CUDA GDB by using the following option before running the application cuda gdb set cuda memcheck on In integrated mode the memcheck tool improves the precision of error reporting by CUDA GDB The memory access checks are enabled allowing identification of the thread that may be causing a warp or device level exception 3 6 CUDA API Error Checking The memcheck tool supports reporting an error if a CUDA API call made by the user program returned an error The tool supports this detection for both CUDA run time and CUDA driver API calls In all cases
24. i i 1 1 1 gt gt gt ar memcheck demo cu 6 6 Ge eter 1 42 cuda gdb print amp x 1 global int 0x400100000 cuda gdb continue Conpeneniumsncre Termination of CUDA Kernel 1 unaligned _kernel lt lt lt 1 1 1 L 1 Ll gt gt gt on Device 0 ermina Tonor CUD AR Koran Cale OM MeMSIS esa site lt lt 4 GIR all sl 9 64 1 1 gt gt gt on Device 0 Program terminated with signal CUDA_EXCEPTION 1 Lane Illegal Address The program no longer exists cuda gdb www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 28 Appendix A MEMORY ACCESS ERROR REPORTING The memcheck tool will report memory access errors when run standalone or in integrated mode with CUDA GDB The table below describes the types of accesses that are checked and the SM version where such checks happen Table 8 Memcheck memory accesss error detection support www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 29 Appendix B HARDWARE EXCEPTION REPORTING The CUDA MEMCHECK tool will report hardware exceptions when run as a standalone or as part of CUDA GDB The table below enumerates the supported exceptions their precision and scope as well as a brief description of their cause For more detailed information see the documentation for CUDA GDB Table 9 CUDA Exception Codes Exception code Precision Scope of the Description of the Error Error CUDA_EXCEPTION_1 Lane Precise Per lane
25. if available and the CUDA kernel name In this example the instruction causing the access was at PC 0x60 inside the unaligned_kernel CUDA kernel Additionally since the application was compiled with line number information this instruction corresponds to line 6 in the memcheck_demo cu source file byaelaracalclian OF OF 0 Meenas blo eke OF O71 0 The third line contains the thread indices and block indices of the thread on which the error was hit In this example the thread doing the erroneous access belonged to the first thread in the first block Address 0x400100001 is misaligned The fourth line contains the memory address being accessed and the type of of access error The type of access error can either be out of bounds access or misaligned access In this example the access was to address 0x400100001 and the access error was because this address was not aligned correctly 2 Hardware exception Imprecise errors are generated for errors that the hardware reports to the memcheck tool Hardware exceptions have a variety of formats and messages Typically the first line will provide some information about the type of error encountered 3 Malloc free error Malloc free errors refer to the errors in the invocation of device side malloc free in CUDA kernels An example of a malloc free error www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 10 Memcheck Tool Malloc Free error encountered Double free a
26. if the API function call has a nonzero return value CUDA MEMCHECK will print an error message containing the name of the API call that failed and the return value of the API call CUDA API error reports do not terminate the application they merely provide extra information It is up to the application to check the return status of CUDA API calls and handle error conditions appropriately 3 7 Device Side Allocation Checking On SM 2 0 and higher GPUs the memcheck tool checks accesses to allocations in the device heap These allocations are created by calling malloc inside a kernel This feature is implicitly enabled and can be disabled by specifying the check device heap no option This feature is only activated for kernels in the application that call malloc The current implementation does not require space on the device heap and so the heap allocation behavior of the program with and without memcheck should remain similar The memcheck tool does require space in device global memory to track these heap allocations and will print an internal error message if it is not able to allocate this space in device global memory www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 12 Memcheck Tool In addition to access checks the memcheck tool can now perform libc style checks on the malloc free calls The tool will report an error if the application calls a free twice on a kernel or if it calls free on an invalid pointer
27. l refers to the cases where the kernel is terminated early and no subsequent instructions are run In such cases the CUDA context is not destroyed and other kernels continue execution and CUDA API calls can still be made When kernel execution is terminated early the application may not have completed its computations on data Any subsequent kernels that depend on this data will have undefined behavior The action terminate CUDA context refers to the cases where the CUDA context is forcibly terminated In such cases all outstanding work for the context is terminated and subsequent CUDA API calls will fail The action continue application refers to cases where the application execution is not impacted and the kernel continues executing instructions Table 7 CUDA MEMCHECK Error Actions Memory access error Device Terminate kernel User can choose to instead terminate the CUDA context Hardware exception Device Terminate CUDA Subsequent calls on the CUDA context context will fail Malloc Free error Device Terminate kernel User can choose to instead terminate the CUDA context cudaMalloc allocation leak Host Continue application Error reported No other action taken CUDA API error Host Continue application Error reported No other action taken CUDA MEMCHECK internal Host Undefined The application may behave in error an undefined fashion Device heap allocation leak Device Continue application Error reported No other action taken
28. lobal atomic instructions for applications using dynamic parallelism In addition the imprecise hardware exception reporting mechanism is also fully supported Error detection on applications using dynamic parallelism requires significantly more memory on the device and as a result in memory constrained environments memcheck may fail to initialize with an internal out of memory error For limitations see Known Issues 5 5 Error Actions On encountering an error CUDA MEMCHECK behavior depends on the type of error The default behavior of CUDA MEMCHECK is to continue execution on purely 1 In some cases there may be no device backtrace www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 19 CUDA MEMCHECK Features host side errors Hardware exceptions detected by the memcheck tool cause the CUDA context to be destroyed Precise errors such as memory access and malloc free errors detected by the memcheck tool cause the kernel to be terminated This terminates the kernel without running any subsequent instructions and the application continues launching other kernels in the CUDA context The handling of memory access and malloc free errors detected by the memcheck tool can be changed using the destroy on devic rror option For racecheck detected hazards the hazard is reported but execution is not affected For a full summary of error action based on the type of the error see the table below The error action terminate kerne
29. low The location column indicates whether the report originates from the host or from the device The precision of an error is explained in the paragraph below Table 5 Memcheck reported error types Name Description Location Precision See also Memory access Errors due to out of bounds or Device Precise Memory Access error misaligned accesses to memory Error Reporting by a global local shared or global atomic access Hardware Errors that are reported by Device Imprecise Hardware exception the hardware error reporting Exception mechanism Reporting Malloc Free Errors that occur due to Device Precise Device Side errors pale use of malloc Allocation free in CUDA kernels Checking CUDA API Reported when a CUDA API call in Host Precise CUDA API Error errors the application returns a failure Checking www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 8 Memcheck Tool Name Description Location Precision See also cudaMalloc Allocations of device memory Host Precise Leak Checking memory leaks using cudaMalloc that have not been freed by the application Device Heap Allocations of device memory Device Imprecise Device Side Memory Leaks using malloc in device Allocation code that have not been freed by Checking the application The memcheck tool reports two classes of errors precise and imprecise Precise errors in memcheck are those that the tool can uniquely identify and gather all informatio
30. n for For these errors memcheck can report the block and thread coordinates of the thread causing the failure the PC of the instruction performing the access as well as the address being accessed and its size and type If the CUDA application contains line number information by either being compiled with device side debugging information or with line information then the tool will also print the source file and line number of the erroneous access Imprecise errors are errors reported by the hardware error reporting mechanism that could not be precisely attributed to a particular thread The precision of the error varies based on the type of the error and in many cases memcheck may not be able to attribute the cause of the error back to the source file and line Imprecise error reporting is only supported on SM 2 0 and higher GPUs 3 3 Using Memcheck The memcheck tool is enabled by default when running the CUDA MEMCHECK application It can also be explicitly enabled by using the tool memcheck option cuda memcheck memcheck options app_name app options When run in this way the memcheck tool will look for precise imprecise malloc free and CUDA APTerrors The reporting of device leaks must be explictly enabled Errors identified by the memcheck tool are displayed on the screen after the application has completed execution See Understanding Memcheck Errors for more information about how to interpret the messages printed by the tool
31. o threads access the same memory location with one thread performing a read and another a write In this case the writing thread is ordered before the reading thread and the value returned to the reading thread is not the original value at the memory location gt Write After Read WAR hazards This hazard occurs when two threads access the same memory location with one thread performing a read and the other a write In this case the reading thread reads the value before the writing thread commits it 4 3 Using Racecheck The racecheck tool is enabled by running the CUDA MEMCHECK application with the tool racecheck option cuda memcheck tool racecheck memcheck options app_name app_options Once racecheck has identified a hazard the user can make program modifications to ensure this hazard is no longer present In the case of Write After Write hazards the program should be modified so that multiple writes are not happening to the same location In the case of Read After Write and Write After Read hazards the reading and writing locations should be deterministically ordered In CUDA kernels this can be achieved by inserting a__ syncthreads call between the two accesses The racecheck tool does not perform any memory access error checking It is recommended that users first run the memcheck tool to ensure the application is free of errors 4 4 Understanding Racecheck Reports The racecheck tool produces a series of messages
32. oint at kernel launch time Host Frame usr local lib libcuda so cuLaunchKernel Ox3ae 0xddbee Host Frame usr local lib libcudart so 5 0 0xcd27 en Host Frame usr local lib libcudart so 5 0 Oxlbb 0x3778b ost Frame memcheck demo 0xdfc ost Frame memcheck demo 0xca0 ost Frame memcheck demo 0xcab ost Frame memcheck demo 0xbbc ost Frame memcheck demo 0xc2c ost Trames iba liber 50 6 libe start meta q Oxfd Oxlebld Host Frame memcheck demo 0x9b9 Program hit error 17 on CUDA API call to cudaFree SSS S555 Saved host backtrace up to driver entry point at ee Host Frame usr local lib libcuda so 0x28 850 a Host Frame usr local lib libcudart so 5 0 cudaFree Ox20d 0x364ed Host Frame memcheck demo 0xc3d SSSSsSse Host Pranew libosyiibc so 6 libe start maim Oxfd Oxlebld www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 23 CUDA MEMCHECK Tool Examples Host Frame memcheck demo 0x9b9 FRROR SUMMARY 3 errors 6 1 3 memcheck demo Output with Memcheck Debug Build The application is now built with device side debug information and function symbols as MVCC Cec OMP Ns Ome rchimamie Gencede arch comoure 207 code sm 20 o memcheck demo memcheck_ demo cu Now run this application with COUDA MEMCHECK and check the output By de
33. ons have a one character short form which is given in parentheses These options can be invoked using a single hypen For example the help option can be invoked as h The options that have a short form do not take a value The second column contains the permissible values for the option In case the value is user defined this is shown below in braces An option that can accept any numerical value is represented as number Blank entries indicate that the value is not present The third column contains the default value of the option Some options have different default values depending on the architecture they are being run on Table 2 CUDA MEMCHECK Command line options demangle full simple no full Enables demangling of device function names For more information see Name Demangling www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 4 Using CUDA MEMCHECK destroy on device context kernel context This controls how the application error proceeds on hitting a memory access error For more information see Error Actions error exitcode number The exit code memcheck will return if the original application succeeded but there were any memcheck detected errors This is meant to allow CUDA MEMCHECK to be integrated into automated test suites flush to disk yes no Forces every disk write to be flushed to disk When enabled this will make CUDA MEMCHECK tools much slower force blocking no on SM 2 0 yes This f
34. or a Write After Read WAR this line contains the value after the access by the first thread as the Current Value and the value that will be written by the second access as the Incoming Value In this case the first thread wrote the value 0 to the shared memory location The second thread is attempting to write the value 2 www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 17 Chapter 5 CUDA MEMCHECK FEATURES 5 1 Nonblocking Mode By default on SM 2 0 and higher GPUs the standalone CUDA MEMCHECK tool will launch kernels in nonblocking mode This allows the tool to support error reporting in applications running concurrent kernels To force kernels to execute serially a user can use the force blocking launches yes option Blocking launch mode is always enabled on Mac OS X 10 6 and on Windows XP This flag has no effect on GPUs less than SM 2 0 One side effect is that when in blocking mode only the first thread to hit an error in a kernel will be reported 5 2 Stack Backtraces In standalone mode CUDA MEMCHECK can generate backtraces when given show backtrace option Backtraces usually consist of two sections a saved host backtrace that leads upto the CUDA driver call site and a device backtrace at the time of the error Each backtrace contains a list of function calls showing the state of the stack at the time the backtrace was created To get function names in the host backtraces the user application must be built with sup
35. orces all host kernel launches launches on SM 1 x to be sequential When enabled the number and precision of memcheck reported errors will decrease This option only has effect on SM 2 0 and higher GPUs prefix string The string prepended to CUDA MEMCHECK output lines print level info warn error fatal info The minimum level print level of messages from CUDA MEMCHECK read filename N A The input memcheck file to read data from This can be used in conjunction with the save option to allow processing records after a run save filename N A Filename where CUDA MEMCHECK will save the output from the current run show backtrace yes host device no yes Displays a backtrace for most types of errors No disables all backtraces Yes enables all backtraces Host enables only host side backtraces Device enables only device side backtraces For more information see Stack Backtraces tool memcheck memcheck Controls which CUDA MEMCHECK tool is racecheck actively running version V version V N A JNa Prints the version of cuda memcheck Prints Prints the version of cuda memcheck version of cuda memcheck Table 3 Memcheck Tool Command line options check device heap yes no yes on SM 2 0 no Enable checking of device heap on SM 1 x allocations This applies to both error checking and leak checking This option only has effect on SM 2 0 and higher GPUs www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 5 Using CUD
36. ponding to the saved host stack at the time the allocation was first made Also presented is a summary of the total number of bytes leaked and the corresponding number of allocations In this example the program created an allocation using cudaMalloc and has not called cudaF ree to release it leaking memory Notice that CUDA MEMCHECK still prints errors it encountered while running the application cuda memcheck leak check full memcheck_demo CUDA MEMCHECK Mallocing memory Running unaligned kernel Ran unaligned kernel no error SWS 0O Error Running out_of bounds kernel Ran OWE Cue OGMs kernels ine ici SVs Mo Error Invalid _ global welce at 0x00000060 in memcheck demo cu 6 unaligned_ kernel void by thread 0 0 0 in block 0 0 0 Address 0x400100001 is misaligned Saved host backtrace up to driver entry point at kernel launch time Host Frame usr local lib libcuda so cuLaunchKernel Ox3ae 0Oxddbee Host Frame usr local lib libecudart so 5 0 0xcd27 Host Frame usr local lib libcudart so 5 0 cudaLaunch Oxlbb 0x3778b SSSSSSS5 Host Frame memcheck demo _ Z10cudaLaunchIcE9cudaErrorPT 0x18 0x122c SSSSSS Host Frame memcheck demo 235 device stub Zlounaligned kernelvv Oxld 0x10a6 ra size 4 Host Frame m
37. port for symbol information in the host application For more information see Compilation Options In CUDA 5 the host stack backtrace will show a maximum of 61 frames Backtraces are printed for most CUDA MEMCHECK tool outputs and the information generated varies depending on the type of output The table below explains the kind of host and device backtrace seen under different conditions www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 18 CUDA MEMCHECK Features Table 6 CUDA MEMCHECK Stack Backtrace Information ss allocation Callsite of cudaMalloc leak CUDA API error API error Callsite of CUDA API call CE l MEMCHECK Callsite leading to internal error internal error Device heap allocation N A N A leak Shared memory hazard Kernel launch on host 5 3 Name Demangling The CUDA MEMCHECK suite now supports displaying mangled and demangled names for CUDA kernels and CUDA device functions By default tools display the fully demangled name which contains the name of the kernel as well as its prototype information In the simple demangle mode the tools will only display the first part of the name If demangling is disabled tools will display the complete mangled name of the kernel 5 4 Dynamic Parallelism The CUDA MEMCHECK tool suite supports dynamic parallelism The memcheck tool supports precise error reporting of out of bounds and misaligned accesses on global local and shared memory accesses as well as on g
38. rt API errors by default For more information see Command Line Options 1 4 How to Get CUDA MEMCHECK CUDA MEMCHECK is installed as part of the CUDA toolkit 1 5 CUDA MEMCHECK tools Tools allow use the basic CUDA MEMCHECK infrastructure to provide different checking mechanisms Currently the supported tools are www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 2 Introduction gt Memcheck The memory access error and leak detection tool See Memcheck Tool gt Racecheck The shared memory data access hazard detection tool See Racecheck Tool Table 1 Supported Modes by CUDA MEMCHECK tool Standalone Mode Integrated Mode frees sid www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 3 Chapter 2 USING CUDA MEMCHECK CUDA MEMCHECK tools can be invoked by running the cuda memcheck executable as follows cuda memcheck options app_name app options For a full list of options that can be specified to memcheck and their default values see Command Line Options 2 1 Command Line Options Command line options can be specified to cuda memcheck With some exceptions the options to memcheck are usually of the form option value The option list can be terminated by specifying All subsequent words on the command line are treated as the application being run and its arguments The table below describes the supported options in detail The first column is the option name as passed to CUDA MEMCHECK Some opti
39. ssvassctwceeevnasceuseecuwacceditesevissetviueuvaisete vavenvasceviieeetasneds 19 5 4 Dynamic Parallelism cis ccsecsecsves eve SEEE DEEN NEE 19 Did Pror ACUONS onean ances messnweetegneasi sales bean RARER 19 Chapter 6 CUDA MEMCHECK Tool Examples sscsscssccsscccsccccccccccsccssssesscessceseeess 21 6 1 Example Use of Memcheck ccccccecscesccecccencesnneeasesecesceeeneesaeeeasesseecasenaenaes 21 6 1 1 memcheck demo OULPUL secr c0c cdwenes Kadswdecusesaneedsieesnneecesaeddadenedounedouseetenneet 22 6 1 2 memcheck_demo Output with Memcheck Release Build ccceeeeceeeeeeeeeeees 22 6 1 3 memcheck_demo Output with Memcheck Debug Build ccc eee eee ee eee eee 24 6 1 4 Leak Checking in CUDA MEMCHECK ccceeee eee ee scence eeee eee eeeeeeseeeseeeeeeeees 25 6 2 Integrated CUDA MEMCHECK Example ccce sees cee eee ence ence eee eeeeeeeeeeeeeneeeeeeeeeee 28 Appendix A Memory Access Error Reporting ccsccssccscscssccsccccccssccssccesscssccesscssees 29 Appendix B Hardware Exception Reporting sccsscssccssccsscccccccsccsccssscesscsscseseeeses 30 www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 ii Appendix C KNOWN ISSUCS vi sscacssccssacbentameciesnccsssnscisns a AEE 32 www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 iii LIST OF TABLES Table 1 Supported Modes by CUDA MEMCHECK tool ccc cece eee eece eee eeeeeeeeeeeeneeeeeeeeees 3 Table 2 CUD
40. t 0x000079d8 by caceac 0 0 0 in block 0 0 0 S s Address 0x400aff920 We can examine this line by line Malloc Free error encountered Double free The first line indicates that this is a malloc free error and contains the type of error This type can be gt Double free This indicates that the thread called free on an allocation that has already been freed gt Invalid pointer to free This indicates that free was called on a pointer that was not returned by malloc gt Heap corruption This indicates generalized heap corruption or cases where the state of the heap was modified in a way that memcheck did not expect In this example the error is due to calling free on a pointer which had already been freed at 0x000079d8 The second line gives the PC on GPU where the error was reported This PC is usually inside of system code and is not interesting to the user The device frame backtrace will contain the location in user code where the malloc free call was made by thread 0 0 0 in block 0 0 0 The third line contains the thread and block indices of the thread that caused this error In this example the thread has threadIdx 0 0 0 and blockIdx 0 0 0 Address 0x400aff 920 This line contains the value of the pointer passed to free or returned by malloc 4 Leak errors Errors are reported for allocations created using cudaMalloc and for allocations on the device
41. thread This occurs when a thread accesses Illegal Address error an illegal out of bounds global address CUDA_EXCEPTION_2 Lane Precise Per lane thread This occurs when a thread exceeds User StackOverflow error its stack memory limit CUDA_EXCEPTION_3 Not precise Global error on the This occurs when the application Device Hardware Stack GPU triggers a global hardware stack Overflow overflow The main cause of this error is large amounts of divergence in the presence of function calls CUDA_EXCEPTION_4 Warp Not precise Warp error This occurs when any thread within Illegal Instruction a warp has executed an illegal instruction CUDA_EXCEPTION_5 Warp Not precise Warp error This occurs when any thread within Out of range Address a warp accesses an address that is outside the valid range of local or shared memory regions CUDA_EXCEPTION_6 Warp Not precise Warp error This occurs when any thread within Misaligned Address a warp accesses an address in the local or shared memory segments that is not correctly aligned CUDA_EXCEPTION_7 Warp Not precise Warp error This occurs when any thread within Invalid Address Space a warp executes an instruction www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 30 Hardware Exception Reporting Exception code Precision Scope of the Description of the Error Error that accesses a memory space not permitted for that instruction CUDA_EXCEPTION_8 Warp
42. un_unaligned run_out_of bounds cudaDeviceReset cudaFree devMem return 0 This application is compiled for release builds as nvcc gencode arch compute_20 code sm_20 o memcheck demo memcheck demo cu 6 1 1 memcheck demo Output When a CUDA application causes access violations the kernel launch may terminate with an error code of unspecified launch failure or a subsequent cudaThreadSynchronize call which will fail with an error code of unspecified launch failure This sample application is causing two failures but there is no way to detect where these kernels are causing the access violations as illustrated in the following output memcheck_demo Mallocing memory Running unaligned kernel Ran unaligned kernel no error Syne unspecified launch failure Running out of bounds kernel Ran out of bounds kernel unspecified launch failure Syne unspecified launch failure 6 1 2 memcheck demo Output with Memcheck Release Build In this case since the application is built in release mode the CUDA MEMCHECK output contains only the kernel names from the application causing the access violation Though the kernel name and error type are detected there is no line number information on the failing kernel Also included in the output are the host and device backtraces for the call sites where the functions were launched In addition CUDA API errors are reported such as the in
43. valid cudaF ree call in the application cuda memcheck memcheck_demo CUDA MEMCHECK Mallocing memory www nvidia com CUDA MEMCHECK DU 05355 001_v5 0 22 CUDA MEMCHECK Tool Examples Running unaligned kernel Ran unaligned kernel no error Syvacs MO ErrOL Running out_of bounds kernel Ran out_of bounds kernel no error SW ino error Invalid g looal welte or size 4 at 0x00000028 in unaligned kernel void by Ehacac O 10 70 an billoek 0710 7710 Address 0x400100001 is misaligned Saved host backtrace up to driver entry point at kernel launch time E Host Frame usr local lib libcuda so cuLaunchKernel Ox3ae 0xddbee SSeSSee Host Frame ust local lib libeudart so 5 0 O0xed27 Host Frame usr local lib libcudart so 5 0 Oxlbb 0x3778b Host Frame memcheck demo Oxdfc Host Frame memcheck demo 0xc76 Host Frame memcheck demo 0xc81 Host Frame memcheck demo 0xb03 Host Frame memcheck demo 0xc27 Host Trame libod Ibe 50 6 Libe Start matn T Oxfd Oxlebld SSSSSSSaa Host Frame memcheck demo 0x9b9 Invalid global welts or size 4 S SSSSSS5 at 0x00000010 in out_of bounds kernel void by taread 0 0 0 ia block 0 0 0 S SSS Ss Address OxffffffffFf87654320 is out of bounds Saved host backtrace up to driver entry p
44. xfec SSSSSSsas Host Frame memcheck demo main 0x2d 0x105c S SS5 dost Preme libgA libec s06 lise start matn i Oxfd Oxlebld Host Frame memcheck demo 0xde9 cudaMalloc time Host Fram 0x236 0xe9746 Host Fram Ox26dd7 gt Host Fram Host Fram cudaMalloc 0x17a gt gt Host Fram Host Fram Oxfd Oxlebl1d gt gt HOSE Mireni Host Fram Host Fram cudaFree 0x20d 0x Host Fram Host Fram Oxfd Oxlebld SSSSSS gt Host Fram SUMMARY R SUMMARY www nvidia com CUDA MEMCHECK e usr oca 17 Leaked 1024 bytes at 0x400200000 Saved host backtrace up to ib libcuda so e usr ocal J ib oca e usr e3 usr 1 oca L 17 ib legal Ox36e6a smemcheck demo main 0x23 io eucagensOr NO GU Glataienes Or HISCUCEVEIE SO driver entry point at cuMemAlloc v2 5 5 0 0p 7b 50 0x1052 e lib64 libc so 6 libe start meim memcheck_ demo e usr ocal Program hit error 17 on CUDA API Saved host backtrace up to driver entry point at 7 Oxde9 cann to cudaFree io libecuda 50 e usr ocal 7 ib 364ed memcheck demo es 1ib64 llbe s06 1 memcheck demo 1024 bytes leaked in 1 o 3 erro S main 0x3e

CUDA-MEMCHECK

Contents

Download Pdf Manuals

Related Search

Related Contents