OpenCL领域公认的权威著作,由OpenCL核心设计人员亲自执笔,不仅全面而深刻地解读了OpenCL规范和编程模型,而且通过大量案例和代码演示了基于OpenCL编写并行程序和实现各种并行算法的原理、方法、流程和最佳实践,以及如何对OpenCL进行性能优化,如何对硬件进行探测和调整。, 本书分为两大部分:第一部分(1~13章),从介绍OpenCL的核心思想和编写OpenCL程序的基础知识开始,对枯燥的OpenCL规范进行了深刻而系统的解读,旨在帮助读者全面、正确地理解OpenCL规范及其编程模型;第二部分(14~22章),提供了一系列经典的案例,如图像直方图、Sobel边界检测过滤器、并行实现3.3.2.5Texture memory.………1重自3534 Maximize Instruction Throughput.……353. 4.1 Arithmetic Instructions363.4.2coNtrolflowinstructions,w..ww...w.w..wwwww.wwwwwww383.4.3 Synchronization Instruction.Appendix A CUDA-Enabled GPUs■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■41Appendix B Mathematical Functions Accuracy.43B, 1 StandardB.1.1 Single- Precision Floating-Point Functions.…,…,…,43B.1.2 Double- Precision Floating- Point functions…,…,,,…45B 2 Native functions47Appendix C Compute Capabilities aaaat49C.1 Features and Technical Specifications……,…,,…,,,……49C2 Floating- Point standard.....,.….,..,……….51C 3 Compute Capability 1.x52C3 1 Architecture52C3.2 Global Memory………aaa.a:aa..::a:aaa:日:=:a:日53C3.2.1 Devices of Compute Capability1.0and1.1.…,…,……53C 3.2.2 Devices of Compute Capability 1.2 and 1.3.......■■■54C.3.3 Shared Memory .54C3..1 32-Bit strided access54C3.3.2 32-Bit Broadcast accessC3.3,3 8-Bit and 16-Bit Access■L■■■■■■■1面1面1重■目■■面1重重面Ⅱ面55C 3.3.4 Larger Than 32-Bit Access:.a.a:a:aa:a.:a日a.BaBa:Ba:aa:a日aa日a55C 4 Compute Capability 2.X56C.4.1 Architecture11■_■■11D■■■D■■56C 4.2 Global Memory11面D自1D面1面11面1面157C.4.3 Shared Memory…,…,…,…,…,…,……,58C,4,3. 1 32-Bit strided access,58C.4.32 Larger Than32- Bit access…,…,…,…,,…,,…58. 4. Constant memory. ..m..... 59C 5 Compute Capability 3.059C.5.1 Architecture59OpenCL Programming Guide version 4.2C.52 Global Memory…….60C53 Shared Memory…62OpenCL Programming Guide Version 4.2List of FiguresFigure 1-1. Floating-Point Operations per Second and memory Bandwidth for the CPuand gpu 8Figure 1-2. The GPU Devotes More Transistors to Data Processing.............9Figure 1-3. CUDa is Designed to Support various Languages and applicationProgramming Interfaces………………….10Figure 1-4. Automatic Scalability11Figure 2-1. Grid of Thread Blocks14Figure2-2. Matriⅸκ Multipliation without Shared Memory…,,,…,,…,…,21Figure2-3.Matiⅸ Multipliation with Shared Memory.………,…,……26OpenCL Programming Guide version 4.2Chapter 1IntroductionFrom Graphics Processing toGeneral-Purpose Parallel ComputingDriven by the insatiable market demand for realtime, high-definition3D graphics,the programmable Graphic Processor Unit or GPU has evolved into a highlyparallel, multithreaded, manycore processor with tremendous computationalhorsepower and very high memory bandwidth, as illustrated by Figure 1-1Chapter 1. IntroductionTheoreticalGFLOP/s1750GeForceGTX580NVDiA GPU Single precision1500NⅥ CIA GPU Double precsionGeForce GTX480intel cpu Single predisonntel cPu dable precision12501000GeForceGTx280750GeForce 8800GTXTesla c2050500GeForce7800 GTX250WestmereGe Force 6800 UltraWooders.Tesla C1060/BLoomfieldGeForce FX5800Sep-01 PentinHarpertownyano Jun-04 0ct-05 Mar-07 Jul-08 Dec-09Theoretical GB/200GeForceGX580GeForceGTx480180CPU160GPUGeForceGX280140120100GeForce8800 GTX80GeForce/800GTX60GeForce 6800GTWestmereBloomfieldGeO区5900WoodcrestPrescottHarpertown020032004200520062007200820092010Figure 1-1. Floating-Point Operations per Second and memoryBandwidth for the cpu and gpu8OpenCL Programming Guide Version 4.2Chapter 1 IntroductionThe reason behind the discrepancy in floating-point capability between the CPU andthe gpu is that the gpu is specialized for compute-intensive, highly parallelcomputation- exactly what graphics rendering is about- and therefore designedsuch that more transistors are devoted to data processing rather than data cachingand flow control, as schematically illustrated by Figure 1-2ControlALUALUALUALUCacheDRAMDRAMCPUGPUFigure 1-2. The gPU devotes More transistors to dataProcessingMore specifically, the GPU is especially well-suited to address problems that can beexpressed as data-parallel computations - the same program is executed on manydata elements in parallel-with high arithmetic intensity -the ratio of arithmeticoperations to memory operations. Because the same program is executed for eachdata element, there is a lower requirement for sophisticated flow control; andcause it is executed on many data elements and has high arithmetic intensity, thememory access latency can be hidden with calculations instead of big data cachesData-Parallel processing maps data clements to parallel processing threads. manyapplications that process large data sets can use a data-parallel programming modelto speed up the computations. In 3D rendering, large sets of pixels and vertices aremapped to parallel threads. Similarly, image and media processing applications suchas post-processing of rendered images, video encoding and decoding, image scaling,stereo vision, and pattern recognition can map image blocks and pixels to parallelprocessing threads. In fact, many algorithms outside the field of image renderingand processing are accelerated by data-parallel processing from general signalprocessing or physics simulation to computational finance or computational biology1.2 CUDATM: a General-Purpose ParallelComputing ArchitectureIn November 2006, NVIDIA introduced CUDATM, a general purpose parallelcomputing architecture with a new parallel programming model and instructionset architecture that leverages the parallel compute engine in NVIDIA GPUs toOpenCL Programming Guide version 4.29Chapter 1. Introductionsolve many complex computational problems in a more efficient way than on aCPUAs illustrated by Figure 1-3, there are several languages and applicationprogramming interfaces that can be used to program the CUDA architectureGPU Computing ApplicationsLibraries and MiddlewareCuFFPhysXmcnta rayCUDPEracgRenderingOpenCLDirectFortranJava andComputePythonNVIDIA GPUwith the CUDA Parallel Camputing ArchitectureFermi ArchitectureGeForce 500 Series(Compute capabilities.x) GeForce 400 SeriesQuadro Fermi Series Tesla 20SeriesTesla ArchitectureGeForce 200 Series Quadro FX SeriesGeForce 9 SeriesQuadro Plex Series Series(Compute capabilities 1.x) GeForce 8 SeriesQuadro NVS SeriesEnterainmentProfessionalComputingFigure 1-3 CUDA is designed to Support Various Languagesand Application Programming Interfaces1.3 A Scalable Programming ModelThe advent of multicore CPUs and manycore GPUs means that mainstreamprocessor chips are now parallel systems. Furthermore, their parallelism continuesto scale with Moore's law. The challenge is to develop application software thattransparently scales its parallelism to leverage the increasing number of processorcores, much as 3D graphics applications transparently scale their parallelism tomanycore GPus with widely varying numbers of coresThe CUda parallel programming model is designed to overcome this challengewith three key abstractions: a hicrarchy of thread groups, a hicrarchy of sharedmemories and barriechroniThese abstractions provide fine-grained data parallelism and thread parallelismnested within coarse-grained data parallelism and task parallelism. They guide theprogrammer to partition the problem into coarse sub-problems that can be solvedindependently in parallel by blocks of threads, and cach sub-problem into finerpieces that can be solved cooperatively in parallel by all threads within the blockThis decomposition preserves language expressivity by allowing threads tocooperate when solving each sub-problem, and at the same time enables automaticscalability. Indeed each block of threads can be scheduled on any of the availableprocessor cores, in any order, concurrently or sequentially, so that a compiledOpenCl program can execute on any number of processor cores as illustrated byOpenCL Programming Guide version 4.2