It will be very nice if you help me to clarify some details about GPU perfomance, because I have stuck here for several weeks. Besides, I'm very sorry for my poor English, but I will try to do my best to explain the problem.
So, about my questions. Let's look at the very simple program - dense matrix multiplication using shared memory. As I understand, Nvidia provides one of it's implementations in cuda programming guide(here is the link): http://docs.nvidia.com/cuda/cuda-c-programming-guide/#shared-memory
It is very simple, and I think everyone who are familiar with CUDA have already seen it. But let's measure this kernel's performance (Gflops). Using "Nvprof" utility we can measure some metrics to compute count of floating point operations, and using cuda events we can measure the execution time of the kernel.
So, for square matrix multiplication (2048x2048 float elements in each matrix), we have (1.7180e+10)/(0.054 * 10^9) Gflpos = 318 Gflops.
Now it's important to say that I'm using GeForce GTX Titan card with peak performance about 3.1 Tflops on single precision. Therefore we have only reached about 1/10 of peak performance, but we have already used all optimizations that I know from my university CUDA course(shared memory, coalesced memory access and so on). Here I would guess that it because it is memory bound problem, but as far as I know it is not right. As an example cuBlas(if I'm right) SGEMM function reaches about 71% of peak perfomance. Of course I understand that it`s very hard to reach cuBlas perfomance, but why can't I reach even 1 Tflop?
So, the questions are:
1) am I right in my reasoning?
2) what are the main reasons why can't I reach even half of peak perfomance?
3) what other optimizations can I use? (here everything that you know will be very usefull - articles, suggestion and so on)
Thank you for your attention!