A GPU-Based Path Tracer: March 2017

Saturday, March 25, 2017

Assignment 3: Accelerating with BVH

Drawing fancy stuff with ray tracing can be very slow even with the modern GPUs. In the previous assignment, my ray tracer rendered a horse, contains approximately 11k triangles, in 3.54 seconds! The rendering time is O(k) for k number of triangles in the scene without any accelerating structure. 3.54 seconds might seem reasonable for the time being but what happens if you want to render the dragon with 871k triangles and sending lots of rays beside the primary rays? My ray tracer rendered the dragon model in 97seconds, and just sent primary and shadow rays. So, an accelerating structure is inevitable to render scenes in a reasonable amount of time.

This assignment required us to implement a BVH tree to speed things up. Before delving into coding, I found some papers implementing BVHs efficiently. Let's list some of them:

On fast Construction of SAH-based Bounding Volume Hierarchies: This paper is one of the most popular one. It tells how to build BVHs with surface area heuristics(SAH) in a binned fashion. I implemented this paper and got nearly %10-15 faster rendering times compared to BVH with a median splitting.

Spatial Splits in Bounding Volume Hierarchies: This paper offers a way to contruct BVH trees which gives the best ray tracing performance. It splits the overlapping primitives and creates high-quality trees. However, construction time is way more slower than other BVH techniques. It is definitely worth trying. As soon as I find a free time, I will implement this paper and share the results.

Parallel Spatial Splits in Bounding Volume Hierarchies: This one presents a fast way to build SBVHs on modern CPUs by means of parallelism and vectorization.

Fast BVH Construction on GPUs: This paper introduces linear bounding volume hierarchy (LBVH) algorithm which makes use of morton code. While construction times are really fast, ray tracing performance is not that promising. However, it offers you a trade-off when you have animated stuff in your scene and have to rebuild the BVH structure every frame.

Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees: Another paper for LBVHs.

Fast Parallel Construction of High-Quality Bounding Volume Hierarchies: This paper has another aspect to build BVHs. The algorithm first constructs an LBVH tree and then restructures the tree nodes to minimize SAH cost.

Understanding the Efficiency of Ray Traversal on GPUs: This paper presents several ways to get your ray tracer run much faster on GPUs.

Review and Comparative Study of Ray Traversal Algorithms on a Modern GPU Architecture: Finally, this paper compares several accelerating structures by building them on CPU and ray tracing on GPU. You might want to look at this before you decide which accelerating structure to go with.

Now, let's have a look at the images rendered for this assignment and respective rendering times.

output of killeroo.xml
kernel execution time without BVH: 8860 milliseconds
kernel execution time with BVH: 13.8 milliseconds
speed-up: 642x

output of dragon.xml
kernel execution time without BVH: 97096 milliseconds
kernel execution time with BVH: 21.5 milliseconds
speed-up: 4516x

Lessons learned:

Every BVH node is 64-byte including bounding box of the left and right child, node index of left and right child, start and end indices to the triangle soup of the mesh. Since memory is the greatest bottleneck in CUDA, care must be taken dealing with the ordering of nodes. I realized that trees constructed in depth-first order gives better results, %5 performance improvement, than the trees constructed in breadth-first order.
In the previous post, I mentioned about the divergence. Well, I saw how it affected my ray tracer when traversing the nodes of the tree. In my first tree traversing implementation that suffers from divergence, I was checking if a node is a leaf or an inner node. If it is an inner node, I put it onto the traversal stack and if it is a leaf node, I do the ray-triangle(s) intersection. However, while some of the threads in warp do the ray-triangle intersection, which is computationally expensive, the threads that do not hit a leaf node would wait until the end of these intersection operations. The reason is that threads in a warp execute in a lock-step. That is, they execute the same instruction at every instruction cycle. However, if they see a branch, the threads that take the branch execute their path while others wait and vice versa. Eventually, I came up with another implementation where leaf nodes are put onto another stack. After tree traversal is finished, every thread does the intersection test by iterating through the stack which contains leaf nodes. I am sure that there are better ways to solve this, however, this gave me a %15 performance improvement.

Thursday, March 16, 2017

Assignment 2: Instancing and Smooth Triangles

Hello again. We are required implement smooth triangles and object instances for this assignment. Smooth triangles make use of three normal vectors, per-vertex, instead of using just a single normal vector for the whole triangle. By means of barycentric coordinates computed for the intersection point and these normal vectors, we can have better visuals. The outputs show this clearly.

Another issue was to implement instancing. Instead of having different memory locations for every instance of mesh, I preferred to use an instance class such that every single instance uses the same vertices but with different transformation matrices. Briefly, we do not transform the vertices but we apply the inverse transformation to the ray. This is more practical in general and when it comes to spheres, it eases our work.

Here are the outputs:

output of horse.xml
kernel execution time: 3.54 seconds

output of horse_instanced.xml
kernel execution time: 10.48 seconds

output of simple_transform.xml
kernel execution time: 2.92 milliseconds

output of spheres_transform.xml
kernel execution time: 3.32 milliseconds

Lessons learned:

I used -use_fast_math in the command line of the nvcc compiler. This command makes nvcc to optimize some of the functions and arithmetic operations by using some intrinsics which approximates the results. This command speeded up my ray tracer up to 1.25x. You can find the details of it in the related section of CUDA C Programming Guide.
Occupancy is one of the most important concept when you are dealing with CUDA. It basically states the ratio of number of active warps(a warp contains 32 threads) to the number of possible active warps(device limit). However, %100 occupancy does not always mean that it gives the best results. For example, in my case, the number of registers per multiprocessor was the limiting case. If I set it to 32(65536 registers per multiprocessor/2048 active threads per multiprocessor) registers per thread, occupancy increases to %100 but threads use a limited amount of registers. If I let the nvcc decides, it uses 48 registers per thread which gives a poor occupancy. I manually set it to 36 and got the best results. However, it might always change.
CUDA featured graphics cards do not have branch predictors. Using branches in your code is not the end of the life. However, if you use it unnecessarily or you put large amount of code(or function that contains) to different execution paths in the same warp, you will suffer from divergence. It is explained neatly in the CUDA C Best Practices Guide
Beside CUDA related issues, there was one thing that made me busy: instancing implementation for the spheres. After we transform our rays to object space by applying the inverse transformation of the object, we calculate the distance parameter in that space. Our professor told us that we can compare and use this parameter with other parameters calculated in other object spaces or in world space. This worked very well for triangles but I was suspicious if it is working for spheres or not. However, here, Matt Phar verifies that it works for the spheres as well. The only difference is that I was using geometric approach for ray-sphere intersection but they use the analytic solution. After some drawing and calculations on the paper, I understood that it is not possible to use distance parameter calculated in the object space in anywhere else if you are using geometric approach. Finally, I adopted the analytic solution for the ray-sphere intersection. Keep in mind that you should not normalize the direction vector of the transformed ray. If you do that, you cannot make use of what I've just explained above. The reason for this is well-explained in the last link.

Saturday, March 11, 2017

Assignment 1: First Blood

Hi! I will implement this ray tracer, I may use "ray tracer" and "path tracer" interchangeably, throughout this semester for the "Advanced Ray Tracing" course taught in Computer Engineering, Middle East Technical University. I built my first ray tracer, again for another course in Saarland University, on CPU. Since then, I always wanted to implement one to utilize the massive parallel performance of GPUs. Thanks to this course, I found the motivation to start to implement an embarrassingly parallel ray tracer.

There were two options to get going: OpenCL or CUDA. Although OpenCL is a cross-platform solution for the project, I decided to go with CUDA since it seems that CUDA has a better support. I will run every program on my Geforce GTX 960M.

I will be letting you know the execution time to produce every image. I only consider kernel execution time on GPU which does basically the whole thing. Therefore, memory allocations, copying host memory to GPU's global memory, etc. will not be considered in calculated time.

I designed two different modes to run the ray tracer. First one is the photo mode. It basically takes a photo, saves it and returns. On the other hand, video mode presents an interactive real-time ray tracer. It can produce up to 500 frames per second for simple scenes. However, since I haven't implemented an accelerating structure yet, the ray tracer doesn't give a real-time performance in complex scenes.

Enough talking, let's show the outputs and respective execution times.

output of simple.xml
kernel execution time: 2.7 milliseconds

output of simple_shading.xml
kernel execution time: 3.42 milliseconds

output of bunny.xml
kernel execution time: 351 milliseconds

Let's list some lessons learned from this assignment:

Choose the number of threads per block wisely. Generally 8x8 or 16x16 gives the best performance.
Be careful when you copy the values from host memory to device memory if you are copying an instance of a class which involves virtual methods. When you try to copy the the instances of this class, vtable pointer is also copied along the member variables. The problem is that this vtable pointer points to memory locations reside in host memory. However, when you try to reach to these addresses in device side, the behaviour is undefined.
"Kernel execution time limit" took my whole day. It could have been really easy to solve it however I did not implement a structure to check runtime cuda call errors. If your default display graphics card and the card on which your cuda code executes are the same, you are likely to have this issue. If the GPU cannot finish executing the kernel in 5-6 seconds,(I do not know the exact limit) it simply ignores the kernel execution and stops. The workaround to this problem can be found here.
As I mentioned in the previous one, write a macro or whatever you wish to check errors related to cuda calls.