This assignment required us to implement a BVH tree to speed things up. Before delving into coding, I found some papers implementing BVHs efficiently. Let's list some of them:

- On fast Construction of SAH-based Bounding Volume Hierarchies: This paper is one of the most popular one. It tells how to build BVHs with surface area heuristics(SAH) in a binned fashion. I implemented this paper and got nearly %10-15 faster rendering times compared to BVH with a median splitting.

- Spatial Splits in Bounding Volume Hierarchies: This paper offers a way to contruct BVH trees which gives the best ray tracing performance. It splits the overlapping primitives and creates high-quality trees. However, construction time is way more slower than other BVH techniques. It is definitely worth trying. As soon as I find a free time, I will implement this paper and share the results.

- Parallel Spatial Splits in Bounding Volume Hierarchies: This one presents a fast way to build SBVHs on modern CPUs by means of parallelism and vectorization.

- Fast BVH Construction on GPUs: This paper introduces linear bounding volume hierarchy (LBVH) algorithm which makes use of morton code. While construction times are really fast, ray tracing performance is not that promising. However, it offers you a trade-off when you have animated stuff in your scene and have to rebuild the BVH structure every frame.

- Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees: Another paper for LBVHs.

- Fast Parallel Construction of High-Quality Bounding Volume Hierarchies: This paper has another aspect to build BVHs. The algorithm first constructs an LBVH tree and then restructures the tree nodes to minimize SAH cost.

- Understanding the Efficiency of Ray Traversal on GPUs: This paper presents several ways to get your ray tracer run much faster on GPUs.

- Review and Comparative Study of Ray Traversal Algorithms on a Modern GPU Architecture: Finally, this paper compares several accelerating structures by building them on CPU and ray tracing on GPU. You might want to look at this before you decide which accelerating structure to go with.

Now, let's have a look at the images rendered for this assignment and respective rendering times.

output of killeroo.xml kernel execution time without BVH: 8860 milliseconds kernel execution time with BVH: 13.8 milliseconds speed-up: 642x |

output of dragon.xml kernel execution time without BVH: 97096 milliseconds kernel execution time with BVH: 21.5 milliseconds speed-up: 4516x |

Lessons learned:

- Every BVH node is 64-byte including bounding box of the left and right child, node index of left and right child, start and end indices to the triangle soup of the mesh. Since memory is the greatest bottleneck in CUDA, care must be taken dealing with the ordering of nodes. I realized that trees constructed in depth-first order gives better results, %5 performance improvement, than the trees constructed in breadth-first order.
- In the previous post, I mentioned about the divergence. Well, I saw how it affected my ray tracer when traversing the nodes of the tree. In my first tree traversing implementation that suffers from divergence, I was checking if a node is a leaf or an inner node. If it is an inner node, I put it onto the traversal stack and if it is a leaf node, I do the ray-triangle(s) intersection. However, while some of the threads in warp do the ray-triangle intersection, which is computationally expensive, the threads that do not hit a leaf node would wait until the end of these intersection operations. The reason is that threads in a warp execute in a lock-step. That is, they execute the same instruction at every instruction cycle. However, if they see a branch, the threads that take the branch execute their path while others wait and vice versa. Eventually, I came up with another implementation where leaf nodes are put onto another stack. After tree traversal is finished, every thread does the intersection test by iterating through the stack which contains leaf nodes. I am sure that there are better ways to solve this, however, this gave me a %15 performance improvement.