I want to use cub to sort an array in each block for me. I call the kernel with multiple blocks, each has 32 threads and each thread has an array of 27 integers. The standard sort according to cubs github page looks like this:
__global__ void foo(...){
int cells[27];
typedef cub::BlockRadixSort<int, 32, 27> BlockRadixSort;
__shared__ typename BlockRadixSort::TempStorage temp_storage;
BlockRadixSort(temp_storage).Sort(cells);
...}
I need to have the cells in shared memory later like this:
__global__ void foo(...){
__shared__ int cells[32 * 27];
...
}
Is it possible in cub to sort arrays already residing in shared memory? or do i have to load all arrays after the sort into shared memory.
Or is there an option to store all the cells in global memory and get them sorted by a cub device function, but separated by blocks of certain size?