7

I have a kernel that is performing poorly on CC 3.0 (Kepler) as opposed to CC 2.0 (Fermi). In the Nsight profiler, the Warp Issue Efficiency chart is showing that 60% of the time, there were no eligible warps and the Issue Stall Reasons chart is showing that 60% of these are due to "Other".

I'm wondering what the Other issue stall reasons are and what I might do to reduce them.

CUDA 5.0. / Nsight 3.0. RC / CC 3.0.

Issue Stall Reasons

Roger Dahl
  • 15,132
  • 8
  • 62
  • 82
  • 1
    Thank you for the info. The kernel is a PBKDF2-HMAC-SHA1 key extender. So, it runs tons of integer adds and bitwise operations. In particular, there are lots of 32-bit integer shifts, which have low throughput on Kepler. There are no FP operations. I guess that, in this case, Other is due to the low throughput integer operations, so there's not much to be done. – Roger Dahl Feb 15 '13 at 23:52

1 Answers1

10

In Nsight Visual Studio Edition 3.0 CUDA Profiler the Issue Efficiency displays a pie chart of the warp stall reasons. The stall reasons are Instruction Fetch, Execution Dependency, Data Requests, Texture, Synchronization, and Other.

For Compute Capability 3.* devices the Other category is the percentage of time that active warps are stalled due to the following reasons:

  • execution unit is busy (reduce use of low throughput integer operations)
  • register bank conflicts (compiler issue that can sometimes be made worst by heavy use of vector data types)
  • too few warps per scheduler

For Compute Capability 5.* and 6.* devices the Other category is the percentage of time that active warps are stalled due to the following reasons:

  • register bank conflicts (compiler issue that can sometimes be made worst by heavy use of vector data types)
  • warps waiting to resolve branches
  • warps that are lower priority and are not currently being considered for scheduling

For 5.* and 6.*, especially gp100, the last reason can be very high (~75%) if the kernel reaches 32 warps per warp scheduler.

These stalls reasons are grouped into the other category as it is hard to identify actions that a developer can taken to resolve these issues.

Greg Smith
  • 11,007
  • 2
  • 36
  • 37
  • ... the third reasons seems pretty actionable to me: "Increase the block dimensions". Am I wrong? – einpoklum Apr 13 '17 at 15:48
  • The workflow is first to look at the issue efficiency. If on 60% of the cycles there were no eligible warps then increasing occupancy is often one of the first things to investigate. The scheduler stall reasons can give more insight into execution issues. Increasing occupancy does not necessarily mean increasing block dimensions. Often smaller warps per block is a better method to increase warps per SM. – Greg Smith Apr 13 '17 at 21:20
  • (1) Can you give a concrete example of your last sentence, in which the "too few warps per scheduler" is the dominant factor (~50% or more) of the stall reasons? If so, I'll make that another question and you can answer it. (2) Is this list of three reasons exhaustive? – einpoklum Apr 13 '17 at 21:34
  • The list is not exhaustive. I have updated the answer slightly for Maxwell (5.*) and Pascal (6.*) as Maxwell/Pascal architecture is significantly different from the previous Fermi and Kepler architectures and the stall reasons are much different. Stall reasons should only be looked at if the percentage of cycles that no instruction is issued is high (>20%). – Greg Smith Apr 29 '17 at 21:27