This book was basically written by NV, its mainly about CUDA.12/19/11Hello world ExampleAllocate host and device memoryAlLocate host and device memoryint *h ad:int *h b,*d bcutilSafeCall(cudaMa l locHost((void**)&h a, memsize))cutilsafecall(cudaMal locHost((void**)&h b, memsize))cutilSafeCall(cudaMalloc((void**)&d a, memsize)cutilsafecall(cudaMalloc((void**)&d b, memsize))Hello world ExampleHost codeKernel parametersdim3 threads(numthreads blocksize, 1);din3 blocks(blocksize, 1)Copy the parameters to GPu gLobal memorcutilsafecall( cudaMemcpy(d a, h a, memsize cudamencpyHostToDevice));cutilsafecall( cudaMemcpy(d b, h b, memsize cudamencpyHostToDevice));Invoke kernelhe lloworld << blocks, threads >>>(d a, d b)Copy the results back to CPUcutilSafecall( cudaMemcpy(h a, d a, memsize cudaMencpyDeviceToHost))cutilSafecall( cudaMemcpy(h b,d b, memsize cudaMencpyDeviceToHost));12/19/11Hello world ExampleKernel codeglobalvoid helloworld (int *a, int *b)int idx= blockIdxx* blockDim .x threadIdx xalida blockEdblida= threadIdx XFTo Try CUDA ProgrammingSsH to13847.102.165Set environment vals in bashrc in your home directoryexport PATH=SPATH: /usr/local/ cuda/binexport LD_ LIBRARY PATH=/usr/local/ cuda/lib: SLD LIBRARY PATHexport LD_ LIBRARY_ PATH=/usr/local/cuda/lib64: SLD_ LIBRARY_ PATHCopy the sdk from home/students/NVIDIA GPU Computing SDKCompile the following directoriesNVIDIA GPU Computing_ SDk /shared/NVIDIA GPU Computing_ SDK/C/common/The sample codes are inNVIDIA GPU Computing SDK/C/src,412/19/11Demo· Hello wor|dPrint out block and thread ids· Vector adcC=A+BCUDA Language ConceptCUDa programming modelCUDA memory model12/19/11Some terminologiesDevice GPU=set of stream multiprocessorsStream Multiprocessor (SM)=set ofprocessors shared memoryKernel =GPU programGrid array of thread blocks that execute akernelThread block group of SiMd threads thatexecute a kernel and can communicate viashared memoryCUDA Programming modelParallel code(kernel)is launched andexecuted on a device by many threadsThreads are grouped into thread blocksParallel code is written for a thread// Kernel definitionglobal void vecAdd(float* A, float* B, float* C)int i threadIdx. x:C[l] =A[l] B[1]12/19/11Thread hierarchyThreads launched for a parallel section arepartition into thread blocksThread block is a group of threads that canSynchronize their executionCommunicate via a low latency shared memoryGrid all thread blocks for a given launchGridBlock(O,0Block (2,0lock(0, 1) Block(1, 1). Block(2,Block (1, 1)Thread(o, o)Thread (1, 0) Thread (2, 0) Thread (3, 0)Thread (o, 1)Thread (1, 1) Thread (2, 1) Thread(3, 1)Thread(0, 2)Thread (1, 2) Thread (2, 2)Thread(3, 2)12/19/11Ds and dimensionsThreadsdIDSUnique within a block -two threads from two differentblocks cannot cooperateBlocks2D and 3D IDs(depend on the hardwareUnique within a gridDimensions are set at launch timeCan be unique for each sectionBuilt-in variablesthreadldx, blockldxlock Dim grid DiGrid 1Kernel 1BlockBlock(0,0)(1,0)2,0)BlockBlockBlock(0,1)(1,1)(2,1)Grid 2terne 2B|ock(1,1)Thread ThreadThread I(0.0)(1,0)(3,0)(4,0)Thread ThreadThreadThreadThread(131)2,1)(3.1(4,D)Thread Thread Thread Threa(0,2)2.2)(4.2)12/19/11Example: Increment Array Elements 2nVIDIAIncrement N-element vector a by scalar bLet' s assume n=16. blockDim=4 - 4 blocksint idx= blockDim.x blockldx +threadIdxxblockldxx-Oblockldx x-1blockldx.x=2blockldx. x=3blockDim.x=4blockDim. xe4block Dim x=4threadldx. x=0, 1, 2, 3 threadldxx-0, 1, 2, 3 threadldx x=0, 1, 2, 3 threadldx x=0, 1, 2 3idx=0,123dx=4,56,7idx=89,10,11id=12,13,14,15Example: Increment Array ElementsnVIDIACPU programCUDA progranoid increment cpu(float "a, float b, int n)__void increment gpu(float"a, float b, intN)for (int idx=O; idxppa, b, 16);ONMDIA Corporation 200712/19/11CUDA Memory Model· Each thread can(Device)GridR/W per-thread registersBlock(0, 0Block(1, 0)R/W per-thread local memoryShared MemoryShared MemoryR/W per-block shared memoryRegistersRegisters RegistersR/W per-grid global memoryThread (0, 0) Thread(1. 0) Thread (0, 0) Thread(1, D)Read only per-grid constant memoryRead only per-grid texture memory eomalThe host can r/w globalHostGlobalMemoryconstant and textureConstantmemoriesTextuMemoryDeviceGPUDRAMMultiprocessorLocalMultiprocessorMultiprocessorGlobalRegisterShared MemoryHostConstantConstant and TexturememoryCachesTexture10