1

I hardly understand what the value given by the multiProcessorCount property represent, due to the fact that I experience difficulties in grasping the CUDA architecture.

I'm sorry if some of the following statements appear to be naive. From what I understood so far, here are the hardware "layers":

  • A CUDA processor is a grid of building blocks.
  • A building block is composed of two or more streaming multiprocessors.
  • A streaming multiprocessor is composed of many streaming processors, also called core.
  • A streaming processor is "massively" threaded, meaning that it implements many hardware managed threads. One streaming processor, one core, can really compute only one thread at a time, but it has many "hardware threads" that can load data while waiting for their turn to be computed by the SP.

On the software side:

  • A block is composed of threads, and is executed by a streaming multiprocessor
  • If one launched more blocks than the number of streaming multiprocessors on the card, I guess blocks wait in some sort of queue, to be executed.
  • Software threads are distributed to streaming processors, which distribute them to their hardware threads. And similar to the previous case, if one launched more threads that the streaming processors can handle with their hardware threads, software threads wait in a queue.

In both cases, the max number of threads, and blocks, that it is allowed to launch, is independent from the number of streaming multiprocessors, streaming processors, and hardware threads of each streaming processor, that actually exist on the card. Those notions are software!

Am I at least close to the reality?

With that being said, what does the multiProcessorCount property gives? On my 610M, it says I only have one multiprocessor... Does that mean that I only have one streaming multiprocessor? I would have a building block composed of only one streaming multiprocessor? That seems impossible to me. And that would mean that I can only execute one block at a time! Besides, when the specifications of my card says that I have 48 cuda cores, are they talking about streaming processors?

Yugo Amaryl
  • 1,249
  • 2
  • 15
  • 21
  • 2
    Yes, `multiProcessorCount` gives the number of streaming multiprocessors (i.e. SMs) that the GPU has. Mobile GPUs typically only have at most a few of these. It's not unreasonable for your GPU to have only one. The internal structure (i.e., what precisely is a "CUDA core" and how many are there) within an SM gets murky. – Jared Hoberock May 19 '13 at 22:04
  • I also have access to a machine having a GT 440. It's not a mobile GPU, despite being cheap. The multiProcessorsCount property says: 2. I'm confused because books and articles always show many building blocks, and say that a building block contain at least two streaming processors. Beside, do you agree with my statements? I find those notions poorly explained wherever I look... – Yugo Amaryl May 19 '13 at 22:14
  • Books and articles tend to emphasize the current flagship chip, which typically has a dozen SMs or more. Even though a building block (sometimes called a TPC) is composed of one or many SMs, it's possible to sell a GPU with some of these SMs disabled. This is called "floor sweeping". – Jared Hoberock May 20 '13 at 04:11

1 Answers1

2

Perhaps this answer will help. It's a little out of date now since it refers to old architectures, but the principles are the same.

It is entirely possible for a GPU to consist of a single SM (streaming multiprocessor), especially if it is a mobile GPU. That single SM, which is composed of multiple CUDA cores, can accommodate multiple thread blocks (up to 16 on the latest Kepler-generation GPUs).

In your case, your 610M GPU has one Streaming Multiprocessor (SM), composed of 48 CUDA cores (aka Streaming Processors, SPs).

Community
  • 1
  • 1
Tom
  • 20,852
  • 4
  • 42
  • 54
  • Ok. So, as long as I don't need more threads than a block can contain, I have no interest in launching more than one block on my 610M, right? (And more than two on the GT440). I wonder why I never saw any code using the multiProcessorCount property to choose the number of blocks and threads... Instead, the stack overflow answer you gave me the link of, states: "you should ensure that when you launch a GPU function your grid is composed of a large number of blocks (at least hundreds) to ensure it scales across any GPU"". – Yugo Amaryl May 20 '13 at 23:10
  • And I don't understand what you mean when you say "The final point to make is that an SM can execute more than one block at any given time." You mean that a block can be paused while another is currently being executed? – Yugo Amaryl May 20 '13 at 23:17
  • 2
    We generally want to write codes that have many more blocks than SMs (like 2-4x more, at least) and therefore lots and lots of threads. This is how the machine hides various latencies. That's why nobody chooses the total number of blocks or threads based on machine architectural details like number of SMs. And yes, multiple blocks can be resident and executing together on a single SM. You might want to take some basic cuda webinars. It could be a couple hours well spent, if you want to understand these concepts further. – Robert Crovella May 21 '13 at 03:58