Saturday, March 11, 2017

Assignment 1: First Blood

Hi! I will implement this ray tracer, I may use "ray tracer" and "path tracer" interchangeably, throughout this semester for the "Advanced Ray Tracing" course taught in Computer Engineering, Middle East Technical University. I built my first ray tracer, again for another course in Saarland University, on CPU. Since then, I always wanted to implement one to utilize the massive parallel performance of GPUs. Thanks to this course, I found the motivation to start to implement an embarrassingly parallel ray tracer.

There were two options to get going: OpenCL or CUDA. Although OpenCL is a cross-platform solution for the project, I decided to go with CUDA since it seems that CUDA has a better support. I will run every program on my Geforce GTX 960M.

I will be letting you know the execution time to produce every image. I only consider kernel execution time on GPU which does basically the whole thing. Therefore, memory allocations, copying host memory to GPU's global memory, etc. will not be considered in calculated time.

I designed two different modes to run the ray tracer. First one is the photo mode. It basically takes a photo, saves it and returns. On the other hand, video mode presents an interactive real-time ray tracer. It can produce up to 500 frames per second for simple scenes. However, since I haven't implemented an accelerating structure yet, the ray tracer doesn't give a real-time performance in complex scenes.

Enough talking, let's show the outputs and respective execution times.

output of simple.xml
kernel execution time: 2.7 milliseconds

output of simple_shading.xml
kernel execution time: 3.42 milliseconds

output of bunny.xml
kernel execution time: 351 milliseconds

Let's list some lessons learned from this assignment:
  1. Choose the number of threads per block wisely. Generally 8x8 or 16x16 gives the best performance.
  2. Be careful when you copy the values from host memory to device memory if you are copying an instance of a class which involves virtual methods. When you try to copy the the instances of this class, vtable pointer is also copied along the member variables. The problem is that this vtable pointer points to memory locations reside in host memory. However, when you try to reach to these addresses in device side, the behaviour is undefined.
  3. "Kernel execution time limit" took my whole day. It could have been really easy to solve it however I did not implement a structure to check runtime cuda call errors. If your default display graphics card and the card on which your cuda code executes are the same, you are likely to have this issue. If the GPU cannot finish executing the kernel in 5-6 seconds,(I do not know the exact limit) it simply ignores the kernel execution and stops. The workaround to this problem can be found here.
  4. As I mentioned in the previous one, write a macro or whatever you wish to check errors related to cuda calls.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.