Entry function uses too much shared data (0x8020 bytes + 0x10 bytes system, 0x4000 max) - CUDA error

Question

I am using Tesla C2050, which has a compute capability 2.0 and has 48KB shared memory . But when I try to use this shared memory the nvcc compiler gives me the following error

Entry function '_Z4SAT3PhPdii' uses too much shared data (0x8020 bytes + 0x10 bytes system, 0x4000 max)

SAT1 is the naive implementation of a scan algorithm, and because I am operating on images sizes of the order 4096x2160 I have to use double to calculate the cumulative sum. Though Tesla C2050 does not support double, but it nevertheless does the task by demoting it to float. But for an image width of 4096 the shared memory size comes out to be greater 16KB but it is well within the 48KB limit.

Can anybody help me understand what is happening here. I am using CUDA Toolkit 3.0.

talonmies · Accepted Answer · 2012-01-29T19:00:56.557

2

By default, Fermi cards run in a compatibility mode, with 16kb shared memory and 48kb L1 cache per multiprocessor. The API call cudaThreadSetCacheConfig can be used to change the GPU to run with 48kb shared memory and 16kb L1 cache, if you require it. You then must compile the code for compute capability 2.0 to avoid the code generation error you are seeing.

Also, your Telsa C2050 does support double precision. If you are getting compiler warnings about demoting doubles, it means you are not compiling your code for the correct architecture. Add

--arch=sm_20

to your nvcc arguments and the GPU toolchain will compile for your Fermi card, and will include double precision support and other Fermi specific hardware features, including larger shared memory size.

edited Jan 29 '12 at 19:00

answered Jan 29 '12 at 10:28

talonmies

70,661
34
192
269

Thanks a lot for the answer, but I have done `cudaThreadSetCacheConfig` and gave this option `cudaFuncCachePreferShared` to make the shared memory to be set at 48KB but still it was showing an error. Could there be some other reason? I am using Visual Studio 2008, and in the options it only shows sm_10 to sm_13, but not further. As you said it is working in compatibility mode, is there a way I can make system wide change to run in newer architecture? Thanks – Sachin Jan 29 '12 at 18:31
You are going to have to compile for the Fermi architecture, otherwise you are not going to get the code built. I don't use visual studio, so I can't help with that, I am afraid. – talonmies Jan 29 '12 at 18:58
You are using old build-rules for the project. Update to CUDA 3.2 or 4.x. The option you have to modify is in Project Properties -> CUDA Runtime API -> GPU -> GPU Architecture(x) to sm_20 – brano Jan 30 '12 at 12:49
@brano: We have used CUDA vs wizard to integrate the CUDA rules, could that be a possible reason for it working compatibility mode? We tried the normal way of including Cuda.rules but that didn't work. Any tips or links that I can go through? – Sachin Jan 30 '12 at 19:17
My suggestion is to update/install the latest CUDA 4.1. http://developer.nvidia.com/cuda-toolkit-41. You will need to install the driver and CUDA toolkit. After that you will right-click on you project in VS and press "custom build rules" and select the runtime API buildrule for CUDA 4.1. – brano Jan 31 '12 at 09:04
Another thing you could try without having to install new CUDA version is to add the command line as talonmies suggested. This could be done under project properties -> CUDA Runtime API -> GPU -> Extra Options. Add "-arch=sm_20" without the ". – brano Jan 31 '12 at 09:23

score 0 · Answer 2 · answered Feb 24 '12 at 20:54

As far as I know Cuda 3.0 supports compute 2.0. I use VS 2010 with CUDA 4.1 . So I am assuming VS 2008 should be also somewhat similar. Right click on the project and select properties-> Cuda C/C++ -> Device ->Code generation. Change it to compute_10,sm_10;compute_20,sm_20

Entry function uses too much shared data (0x8020 bytes + 0x10 bytes system, 0x4000 max) - CUDA error

2 Answers2

Linked