I am using Tesla C2050, which has a compute capability 2.0 and has 48KB shared memory . But when I try to use this shared memory the nvcc
compiler gives me the following error
Entry function '_Z4SAT3PhPdii' uses too much shared data (0x8020 bytes + 0x10 bytes system, 0x4000 max)
SAT1
is the naive implementation of a scan algorithm, and because I am operating on images sizes of the order 4096x2160
I have to use double to calculate the cumulative sum. Though Tesla C2050
does not support double, but it nevertheless does the task by demoting it to float. But for an image width of 4096 the shared memory size comes out to be greater 16KB but it is well within the 48KB limit.
Can anybody help me understand what is happening here. I am using CUDA Toolkit 3.0.