We present an algorithm for constructing kd-trees on GPUs. Thisalgorithm achieves real-time performance by exploiting the GPU’sstreaming architecture at all stages of kd-tree construction. Unlikeprevious parallel kd-tree algorithms, our method builds tree nodescompletely in BFS (breadth-first sealgorithm I Kd-Tree ConstructionAlgorithm 2 I arge Node Stageprocedure Buildtree(triangles: list)procedure PROCESSLARGENODES(begin// initialization stageout smalllist, netlist: list)modelist← new listactivelist← new listl group triangles into chunkssmallest← new listfor each node i in activelist in parallelnetlist← new listGroup all triangles in node i into fixed size chunks, storeCreate rootnodechunks in chunklistaclivelisl. add(roolriodefor each input triangle t in parallel/ compute per-node bounding boxCompute AABB for triangle tfor each chunk k in clurnklisl in parallelCompute the bounding box of all triangles in k, using stan∥ large node stagedard rcductionwhile not activelist emptyPerform scgmcntcd rcduction on pcr-chunk reduction result tonodelist. append(activelist)compute per-node bounding boxnetlist. clearPROCESSLARGENODES(activelist, smalllist, netlist)l split large nodesSwap netlist and activelistfor each side j of node,. Sxfor each node i, in activelist in parallel∥ small node stageif i contains more than Ce empty space onPREPROCESSSMALLNODES(smalllistside j thenactivelist←smalCut off is empty space on sidewhile not activelist emptySplit node i at spatial median of the longest axisnodelist. append(actielistfor each crcatcd child nodc chnetlist. clear(netlist. add(ch)PROCESSSMALL NODES(activelist, netlist)Swap netlist and activelistl sort and clip triangles to child nodefor each chunk k in chan. k i st in parallel//kd-tree output stage←-k. node(PREORDERTRAVERSAL(nodelist)for each triangle t in k in parallelndift is contained in both children of i thenlargeNodeNodeSort to and ti into two child nodcscut off empty spaceClip to and ti to their respective owner nodeelseedian splitSort t into the child node containing it// count triangle numbers for child nodesinglesTrianglesfor each chunk k in chunklist in paralleli←-k.node(Figure 2: Two cases of large node split.(a) cut off emply spuce,Count triangle numbers in is children, using reduction(b)spatial median splitPerform segmented reduction on per-chunk result to computepcr-child-nodc triangle numbcrThc sah cost function is dcfined as∥ small node filteringfor each node ch in netlist in parallelif ch is small node thenSAH()=Cts+CL(a)Al(a) Cr(a)ar(al4.l 7,.st. add(ch)netlist. delete(ch)here Cts is the constant cost of traversing the node itself, Cl(a)enis the cost of the left child given a split position a, and Cr(a)isthe cost of the right child given the same split. AL (a) and Ar(a)are the surface areas of the left and right child respectively. A is thesurface area of the node. Note that CL() and CR() can only beIn the initialization stage, global memory is allocated for tree conevaluated after the entire sub-tree has been built. Instead of seekingstruction and the root nodc is crcatcd. Additionally, a strcaminglobally optimal solution, existing algorithms use a locally greedystep is performed to compute the AABB(axis aligned boundingapproximation by assuming the children are leaf nodes. In this caseCr (a) and CR(r)equal the number of elements contained in theuser-specified threshold for large/small node is set as T-6 ,thebox) for each input triangle. In our current implementation, theleft and right child respectively3.1 Large Node StageAlgorithm Overview The algorithm takes a triangle soup as in-put and follows the construction pipeline as shown in Algorithm 1As mentioned, the SaH evaluation in the conventional greedy op-After an initialization step, the algorithm builds the tree in a BFs timization algorithm assumes that the current split produces twomanncr, for both largc nodes and small nodcs. Finally, all nodcs of lcaf nodcs. For largc nodcs, this assumption is almost always un-the tree are reorganized and stored. The pipeline consists of a set of true. The resulting estimation is far from accurate. Our splittingstream processing steps together with minimal coordination workscheme for large nodes is a combination of spatial median splittinThe streaming steps are done on the gpu while coordination work and"empty space maximizing", which is highly effective for theis done on the CPU at negligible costsupper levels of the tree as noted in [Havran 2001. Specifically, ifAlgorithm 3 GPU Segmented ReductionNote that the chunk data structure is critical for optimal perfor-procedure GPUSeGredUCe(mance.Within each chunk, we only need to perform unsegmentedin data, owner: list; op: reduction operatorreduction on all triangles AABBs, greatly reducing the elementout result: list)number in the subsequent segmented reduction. Although it ispossible to compute the node bounding boxes by performing seg-resalt← new listmented reduction on all input triangles AaBBs directly, this is inFill result with op's identity elementefficient because large segmented reductions are about three times// assume there are n elementsslower than large unsegmented reductions [Sengupta et al. 2007for d=0 to log?for each i =0 to(n-1)/2 in parallelIn the third step with computed node bounding boxes, large nodeso0←oner2+are split in parallel using the splitting scheme described earlier.Note that we repeatedly split a node using empty space splittingU1←O2e2+until a spatial median split is reached. This allows us to reuseifo≠ w thenthe bounding box and avoid unnecessary computations after emptresult(w1+op(result(], data[2d+li+2D)space splittingelsedata[2ti+op(data[2+ il, data(2+i+2])In the fourth step triangles are sorted and clipped into child noendTriangle sorting is essentially list splitting. For each chunk, the tri-angles in the chunk are first checked to generate a vector of booleanOperator Identity value Usagevalues, which indicates whether each triangle is in a child node ormIncompute bounding boxnot. Then the triangles are divided into two groups, with all the trimax -o compute bounding beangles marked truc on thc Ieft sidc of the output vcctor and all thecount triang le numbertriangles marked false on the right side. This can be easily done using the split operation described in [Sengupta et al. 2007]. For thoseTable 1: Reduction operators and their usage in algorithm 2triangles contained in both child nodes, another pass is needed toclip them into the nodes.the empty space contained in the current node is larger than a prede-In the final step, we count the triangle numbers for all child nodesfined ratio Ce along one axis, the empty space is cut off in the nextusing scgmcntcd reduction in a way similar to bounding box com-split; otherwise, the split plane is chosen at the spatial median ofthe node's longest axis(see Fig. 2). Currently, we take Ce= 257mutation. The reduction operator used here is + If the triangle number of a node is less then the threshold T, it is added toNote that, to apply this splitting scheme, a tight bounding box of allsmalllist and deleted from netlisttriangles contained in the nodc has to bc computedThe large node processing procedure, PROCESSLARGENODES, is3.2 Small Node Stageelaborated in Algorithm 2. This procedure takes aclivelisl as in-put, and updates smalllist and netlist as output. Note that weCompared to the large node stage, the small node stage is relativelyalso maintain a trianglc-nodc association list for cach nodc list. Thcsimple. First, the computation is parallelized over nodes rather thantriangle-node association list stores triangle indices contained in thetriangles. The workload among small nodes is naturally balancednode list, sorted by node index. Each node in the node list recordsbccausc thc triangle numbcrs of small nodcs do not vary signif-the index of its first triangle in the triangle -node association list andicantly (from 0 to T). Second, unlike in the large node stagethe number of triangles it contains, the scene space it occupies, andwe choose not to clip triangles when splitting small nodes. Althe pointers to its child nodesthough clipping triangles to owner nodes reduces false positives ofthe triangle-in-node test and always reduces the Sah cost, clippingNow we walk through the major steps of PROCESSLARGENoDES in may also cause undesirable excessive splits because SAh does notAlgorithm 2. The first step of the procedure is to group all triangles take memory costs into account. While clipping is effective forin each node into fixed-sized chunks. Currently we set the chunklarge nodes by preventing false positives from accumulating oversize to N= 256. A large fraction of the subsequent computationsfuture splits, for small nodes our experiments indicate that clippingare parallelized over all triangles in these chunksrarely improves ray tracing performance. Thus we do not clip tri-angles for small nodes, and the splitting plane candidates are re-In the second step, the bounding box of all triangles in each node isstricted to those determined by the faces of the AAB Bs of trianglescomputed. This is done by first computing the bounding box of allcontained in the initial small nodestriangles's aabbs in each chunk using the reduction algorithm dscribed in Algorithm 4 of [Popov et al. 2007], and then computingAs shown in Algorithm 4. the small node stage consists of twothe bounding boxes of all nodes by performing segmented reductionprocedures, PREPROCESSSMALLNODES and PROCESSSMALLNODES[Gropp et al. 1994] on the sequence of all chunk reduction results. The first procedure collects all split candidates. It also generatesSegInented reduction perforIns reduction on arbitrary segments of the triangle sets contained in both sides of each splitting plane can-an input sequence. The result is a sequence in which each elementdidatc with a singlc pass ovcr the triangles in a nodc. The scc-holds the reduction result of one segment.ond procedure PROCESSSMALLNODES splits small nodes. Processedin parallel for each node i, the procedure first gets its triangle setOur gPu algorithm for segmented reduction is described in Altriangle set and its uppermost ancestor small root (also a smallgorithm 3. In the input list data, all data elements belonging tothe same segment are located contiguously. In another input list didates located inside the node are computed. Finally the node isowner, owner indicates the segment index of data il. The resplit using the optimal split plane with minimal cost, and trianglesduction opcrator op is associated with an identity valuc, as listed inre sorted into child nodesTable I. The algorithm takes a multi-pass approach. Each threadtakes two elements. If the two elements have the same owner, they Instead of storing the triangle sets in the triangle-node associationare replaced by their operation result. Otherwise, one element is lists as is done in the large node stage, we now store triangle sets inaccumulated into result and the other is retainedsmall nodes as a bit mask of its smallroot as shown in Fig 3. NoteAlgorithm 4 Small Node stageAlgorithm 5 Preorder Traversalprocedure PREPROCESSSMALLNODes(smalllist: listproccdure PREORDERtRAVERSAL(nodelist: list)Deginfor each node i in smalllist in parallelfor cach trcc lcvcl l of nodelist from bottom-upfor each split candidate j in i in parallel ai split list< list of all split candidates ilUPPASS(O)Allocate tree using root node' s siz/*"left represents smaller coordinatefor each tree level l of nodelist from top-downo.eft+ triangle set on the left ofDOWNPASS(U)end 3 . right triangle set on the right ofendprocedure UPPAsS(activelist list)procedure PROCESSSMALLNODES(bin activelist: listfor each node i in activelist in parallelout netlist: list)if i is not a leaf thenisize ileft. size +i right size+1for each node i in uclivelisl in parallelelsel compute SAlI and determine the split planei size i. triangleCount +15← trianglesr←2. small rootrea of nodeprocedure dowNpass(activelist: list)beginSAH0←‖8for j where i∈r. split list and . triangle∈sfor each node i in activelist in parallel‖s∩nj.left‖if i is not a leaf thenCR←‖ sn. righle ft add+1AL +area of left child after split jiright address +2address+1+ileft. sizeAR area of right child after split jStore node i in final format to iaddressSAH,+(CLAL +CRAR)/A0+Ctendp The split candidate that yields minimal SAH∥ split small nodes(a)Node aRLarge nodeif SAH≥ SAHo thenSmall nodMark i as leaf nodetriangle setSplit i using p, add new nodes to netlistnodeASort triangles to new nodesndBSplit Planethat the triangle sets of each split candidate j, i left and j. right,Figure 3: Storing triangle sets as bit masks of small root. Node aarc also stored as bit masksis split into node B and node c as shown in(a). Triangles b and cWith this bit mask representation, triangle sorting and SaH evaluare subsets of their small root A's triangles. They are stored as bitation for any split candidate can be efficiently done using bitwisemasks as shown in(b)operations. As shown in Algorithm 4, the bit mask of the lett childis computed as the bitwise and of the bit mask of the current nodetraverses the tree top-down to compute the starting address in thes and the bit mask of thc lcft side of the split candidate j, whichtraversal for each subtree and distributes node information to theis precomputed in PREPROCESSSMALLNODES. Then a parallel bit corresponding address to produce the final tree. This is analogouscounting routine [Manku 2002] is performed on the resulting bitto the parallcl scan in [Sengupta ct al. 2007]. Notc that, in proccdurcmask to get the number of triangles in the left childPREORDERTRAVERSAL. we need to collect nodes located at the sametree level. Fortunately this information is already available in eachThe bit mask representation allows us to compute the optimal splitwhile-loop in Algorithm 1lane in o(n) time and sort triangles in O(1)time. An alternativemethod for computing the optimal splitting plane in O(n) is to sort After preorder traversal, each node in the resulting node list recordsall split candidates in a preprocess. Then the cost functions of all the number and indices of the triangles it contains, its splittingsplit candidates and the optimal splitting plane can be computedpllane. and the links to its childrenby only a single pass over the sorted data, at the cost of O(n)However, since the sorted order cannot be represented as a bit mask3.4 Implementation Detailstriangle sorting can only be done at the cost of o(n)We implcmcntcd the abovc kd-trcc builder using nvidia's cuda3.3 Kd-Tree Output Stageframework [NVIDIA 2007]. CUDA provides a general-purpose Clanguage interface for GPU programming. It also exposes some.s described in Section 4, our GPu ray tracer is stack-based and itimportant new hardware features which are useful for data-parallelrequires the kd-tree's final layout to be a preorder traversal of nodes computations. For example, it allows arbitrary gather and scatterfor optimal cache performancememory access from GiPU programs. Our GiPU implementationheavily makes use of these new featuresWe compute the preorder traversal using two parallel BFS traversalssee Algorithm 5). The first pass traverses the tree bottom-up to In all the algorithm listings above, the parallel primitives(e. g, seg-compute required memory size for each subtree. The second passmented reduction) and the code fragments marked by in paral-#procs Fig4(a) Fig 4(b) Fig 4(C) Fig 4(d) Fig 4(e) Fig4(f)60037s0.057s0.197s0s0.4630564s320022s0.034s0.107s0.139s0.242s0.292s4800180.026s0.077s0.098s0.1690.202s0016002s0063s0079013800015s0.020s0.055s0.0680.113s0.132960014s0.019s0.049s0.060s0.100s0.6s1200130.018s0.046s0.050.091sO.105(a) Toys(b)Museum(c) Robots1280.012s0.017s0.039s0.053s0.077s0.093speedup 3.083.355.054.906.01600Table 3: Scalability of our kd-tree construction algorithm on aGe Force 8800 ULTRA graphics card. The bottom row shows thespeedup going from 16 to 128 processors. Note that our algorithmscales better with large scenes. However, the scalability is still sublinear mainly because the total running time contains a constantportion due to the overheard of CUDa aPl(d) Kitchen(e) Fairy Forest(f) DragonFigure 4: Test scenes for kd-tree construction and ray tracing.(a)-Toys11K triangles, I light;(b)27K triangles, 2 lights, 2 bounces,(c)500aMuseumZIK triangles, 3 lights, I bounce,(d)liK triangles, 6 lights,bounces:(e)178K triangles, 2 lights; ( 252K triangles, I lightr Robots400t KitchenOff-line cpu builderOur gpu builderScene300DragonFig.4(a)0.0850.02279.00.012s001867.9Fig.4(b)0.1080.10976.00017s0.10838.3Fg.4(c)0.4870.165s6860.039s0.157s59.7Fig.4(d0.5590.26s49.60.053s0.207s77.8100Fig.4(e)1.22600877440.07s0078946ig.4()1.354s0.027s124.20.093s0.025193.9Table 2: Comparing kd-tree construction time Ttrce, ray tracing#processorstime Ttrace and SAH costs between an offline CPU builder and ourGPU builder All rendering times are for 1024 1024 imagesFigure 5: The tree construction time decreases quickly with the in-crease in the numberof gPU processors before reaching a plateaulel are GPu code; others are CPu code. We also need to specifythe number of thread blocks and threads per block for the parallel3.5 Results and discussionprimitives and the code fragments marked by in parallel. In ourcurrent implementation, we use 256 threads for each block. TheThe described algorithm has been tested on an Intel Xeon 3. 7GHzblock number is computed by dividing the total number of parallel CPU with an NVIDIA GeForce 8800 ULTRA (768MB)graphicsthreads by the nunber of threads per blockcard Parameters(e. g, I and N)used during tree construction areDuring kd-tree construction, we store all data as dynamic lists inintentionally kept the same for all sceneslinear device memory allocated via CUDA. List size is doubledwhenever more memory is required. This allows us to avoid highWe compare our GPU algorithm with an off-line CPU algorithmoverhcad in CUDA memory managcmcnt aftcr an initial run, at thewhich always uses the greedy SaH cost to calculate optimal splitcost of more memory consumption. For structures with many fieldsplanes and clips triangles into child nodes [Wald and Havran 2006such as nodes and triangles, we use structure of arrays (SoA)in-Table 2 summarizes the comparison results for several publiclystead of array of structures(AoS) for optimal GPU cache perfor-available scenes as shown in Fig 4. As shown, our kd-tree construcnancetion algorithm is 6 15 times faster for all scenes. The quality ofthe trees is assessed in two ways. First, we compute the SAH costsFrom its description, the reader may have noticed that our algo-Second, we evaluate the practical effect of tree quality on renderrithm also frequently calls certain parallel primitives such as reduce time by using the constructed trees in a ray tracer as described inand scan. Many of these primitives have been efficiently impleSection 4. As shown in the table, our al gorithm generates lowermented and exposed in CUDPP [Harris et al. 2007. Most conSAH cOStS for Toys, Museum and robots, hut higher SAH coStsditional program flows in the pseudo code are handled using list for Kitchen, Fairy Forest and Dragon. In all cases, our trees alwayssplitting, which is also a standard gPu primitive with optimized offer better rendering performance, which attests to the high qualimplementation Sengupta et al. 2007. The conditional programs ity of our trees in practical applications. Note that SAH cost is thein Algorithm 3 (lines 12 n 15) will be serialized and result inexpected cost for a ray to traverse the entire tree, whereas actual kdperformance penalty, but the chunk structure used to perform mosttree traversal terminates at the first node of intersection Thereforecomputations in the pcr-chunk standard reduction in Algorithm 2 thcrc is no strict corrclation bctwccn thc SAH costs and the actualavoid these conditional program flows. Compared to per-chunk ray trace time. SAH coSt is only one way to measure the quality ofstandard reductions, the segmented reduction in Algorithm 3 does kd-trees. The most important metric is how well the resulting treenot consume any significant processing time, and its performance accelerates ray traversals, which is the ultimate goal of an Sah treeissues can thus be safely ignoredconstruction strateScene [Wald071 [Shevtsov071 Our methodg.4(a)10.5fps23.lps32.OfpsFig 4(b)8.00pFig 4(c4.96fpsF.4(d)4. 84IpsFig. 4(e)20ps5.846. 40fpsFig 4(Dn/a8.85fTable 4: Performance comparison results for four dynamic scenesAll images are rendered at resolution 1024 1024.Waldo/ timesFigure 6: GPU ray tracing of a dynamic subdivision surfacefrom /Wald et al. 2007/on an AMD Opteron 2.6GHz CPU. The scene consists of 47K triangles. The armadillo model is diMulti-core times are from /Shevtsov et al. 2007) on a Dual Intel rectly evaluated on the GPU through subdivision and displacementCore2 Duo 3. 0GHz (4 coresmapping from a coarse control mesh. We can achieve 22 fps for800×600 lmagesOur kd-tree construction algorithm also scales well with the numberof GPU processors. The running time contains a scalable portionand a small non-scalable portion due to the overhead of CUDA APIscenes in this paper.and driver. Theoretically, the running time is linear with respectIn order to handle reflection/refraction, our ray tracer performs theto the reciprocal of the number of processors. As shown in Ta-following multiple passes after building a kd-tree for the sceneble 3 and Fig. 5, we ran the algorithm on a Geforce 8800 ULTRAgraphics card with 16, 32, 48, 64,, 96, 112, and 128 processors1. Spawn and trace eye raysrespectively. The NVStrap driver in RivaTuner [Nicolaychuk 2002. Generate a list of hits on specular and refractive surfaces byis used to disable processing units by adjusting hardware masksperforming a list compaction [Harris et al. 2007] on eye rayhit pointsAlthough our technique is capable of constructing high quality kd-trees in real-time. it has its limitations For small scenes with less3. Spawn and trace reflective and refractive raysthan 5K triangles, CUDAS API overhead becomes a major bottle4. Repeat Step 2 and Step 3 if there are more bounces to handle,neck. In this case, it is more efficient to switch to a complete CPU5. Spawn and trace shadow rays;method. Also, our method consumes much more memory than a6. Compute shadingCPU mcthod. This is mainly duc to the usc of doubling lists andextra bookkeeping for BFS order constructionOur system supAfter the shading is computed, each rays contribution to the finalports scenes with up to 600K triangles on the Ge Force 8800 Ultrainage is sent to an OpengL pixel buffer object(PBO). The PBO is(768MB) graphics card. For the six tested scenes, the peak mem-then accumulated to the final image using alpha blendingory in our build is around &MB, 18MB, 5OMB, 90MB, 123MBand 17&MB respectively. This problem, however, can be reducedExperimental Results We tested our GPU ray tracer using the dy-with a better memory management scheme. For example, currentlynamic scenes shown in Fig. 4. Table 4 compares our frame rateswith those reported in two recent works. One is an algorithm basedwe keep many temporary data structures in memory at all stagesto avoid costly CUDA API calls to free these temporary data. Ifon bounding volume hierarchies (BVHs)[Wald et al. 2007 ande implenent a set of efficient CUDA memory allocation/free rou-the other is the multi-core CPu algorithm using kd-trees [Shevtsovtines, we will be able to free temporary data and reduce memoryet al. 2007. The performance takes into account both the tree(orconsumption considcrably. Othcr techniques for reducing mcmBVH construction and rendering time. It can be seen that our alory are certainly possible and are to be investigated in future workgorithm runs interactively with shadow and multi-bouncc rfccThe memory consumption issue is also alleviated with the rapidtion/refraction, and outperforms the other two algorithms. Theseadvancements in graphics hardware. NViDIa recently releasedresults suggest that for dynamic scenes GPU ray tracing accelerQuadro Fx 5600 which supports CUDA and has 1.5GB memoryated by our kd-trees provides a competitive alternative to CPU raytracing on multi-core CPUs. Note that here we do not claim thatour GPU ray tracer is faster than all cPu ray tracers. Indeed, im4 GPU Ray Tracingplementing the fastest CPU ray tracer is like chasing a moving tar-We have incorporated our kd-tree builder into a GPU ray tracer forget because various optimizations could be used for higher perfor-arbitrary dynamic scenes. For each frame, the ray tracer first buildsmance and some optimizations are hardware dependent, and betterperformance can be achieved by adding more CPU cores. Forexa kd-tree from scratch. For each ray to be traced, the ray tracerample, [Wald 2007] reported 13 21 frames per second for thewalks through the kd-tree until it reaches leaf nodes and the associ-exploding dragon scene(Fig 4(f))on a 2.6Gllz Clovertown systemated triangles in front to back order.ith 8coresWhile existing GPU ray tracers [Foley and Sugerman 2005; HornNote that for the Toys and fairy forest scenes, our frame rates areet al. 2007; Popov et al. 2007] adopt a stackless scheme for kdhigher than the 4-core CPU algorithm [ Shevtsov et al. 2007. Bothtree traversal, they require additional information to be precomscenes actually do not reveal our method,s advantage in tree qualityputed and stored during tree construction, and extra computationduc to the lack of divergent secondary rays from rcficction/rcfraeduring tree traversal. To avoid such overhead we chose to impletion. However, this already demonstrates the potential of ray tracinment a conventional stack-based scheme on the GPU. As pointeddynamic scenes on GPUsout in [Horn et al. 2007, when a ray passes through both sidesof a splitting plane, the "far"subtree is pushed into the stack and A unique feature of our ray tracer is that it can efficiently handlethe "ncar subtree is traversed first. For this reason a stack -bascddynamic gcomctrics that are directly evaluated on the gpu, such asscheme requires a local stack for each thread. Fortunately, this canskinned meshes Wang et al. 2007 and subdivision surfaces [Shiuebe efficiently implemented in Cuda by allocating a fixed-sized aret al. 2005]. The armadillo in Fig. 6 is such an example. The inputray in thread-local memory. Although kd-tree depth is unbounded geometry is a sequence of coarse control meshes provided by thein theory, we found that a stack depth of 50 is enough for all test authors of [Zhou et al. 2007]. Two levels of Loop subdivision anddisplacement mapping are performed on the gpu to generate thealgorithm 6 KNN Searchdetailed meshes. The output of GPU subdivision and displacementfunction KNNSEARCII(in q: pointmapping is immediately sent to our GPu kd-tree builder and thenbeginray traced directly without copying back to the CPU. Please see the7an←-0accompanying video for live demos7max←T0hist new arraylO. nhiat-l5 GPU Photon Mappingfor i= l to nit←TmaxIn this section we first show how to adapt our kd-tree builder for△nirphoton mapping. Then we describe how to perform k-nearest-Set all elements in /Lisl to zeroneighbor (KNN) search using kd-trees onGPU. Finally wefor each photon p,‖lp-q‖r, via range searchshow how to use the kd-tree builder and knn search to render caustics, and present some experimental resultsIncrement hist[l maxilp-gl-Tuine 0-mhistFind j, such that hist]< k< hist 1i+15.1 Kd-Tree for Photon Mapping(rmn,rmar)←(rmin+n,△,rmtn+++1△n)7k←7maaAlgorithm 1 can be used to build photon kd-trees after several mod-ifications. First, we use VVH [Wald et al. 2004 instead of sah toreturn all photons p,‖p-q‖

资源预览