1

I used roofline model for analysis of code optimization.

But I found the point with green color is out of the area of boundary of bandwidth.The program can run without problem.

I don't understand why the green point is not in the area of red region.

The machine is a virtual linux machine with 2 cores,my local physical machine has 4 cores.The analysis tool is intel advisor.

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

Lance
  • 39
  • 9

1 Answers1

0

This is possible if the application is not bound by the DRAM but by the CPU caches which have a significantly higher throughput. For example, the application can be bound by the LLC cache of your processor (LLC means Last Level Cache, which is typically a level 3 cache).

Note that a 4.26 peak flop performance is pretty low for a relatively-recent processor (AFAIK my i5-9600KF reaches ~200 GFlops in double-precision and ~400 GFlops for simple-precision). 36.31 is more realistic but still quite low unless your target processor is pretty old or is an embedded/mobile one. Also please note that virtual machine running on shared machine are generally not great for profiling as users can possibly simultaneously run other application using the DRAM resulting in misleading profiling results.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • Hello Richard, thanks very much for your suggestion.I have used another machine for analysis of performance.Then I found some differences.One is original serial code and the other is openmp code. But the problem is still 4.15GFlops. Should I use SIMD to optimize the code? – Lance Oct 27 '22 at 17:51
  • 1
    In OpenMP code,the top two plots are related to serial program ,but these two plots reach the peak performance,I am a little puzzled by the peak performance with value 4.15GFLOPS. – Lance Oct 27 '22 at 18:11
  • Ppeak=#cores*#FMA*#SIMD*f =4*1.6*2*512/64=102.4GFLOPS – Lance Oct 27 '22 at 18:32
  • Yeah, using SIMD instruction should be a good idea for the parallel code since it is nearly saturating the computing units based on the provided graphs (except maybe one kernel on the left, and possibly the three in the middle) especially for the point on the right (which is suspicious). – Jérôme Richard Oct 29 '22 at 10:02
  • Actually, 4.15 for a scalar sequential code is not too bad (I missed that). My processor reach 9 GFlops in this case (18 for the FMA). It looks like your architecture is designed to execute efficiently FMA-based code but not all codes can benefit from it. That being said, you should get a better peak Flops with more core (a bit less than twice for 2x cores). If this is not the case, then it means the target processor is very weird: sharing a SIMD unit between 2 cores, down scaling the frequency a lot, etc. I guess the result reported by Intel Advisor are per core. – Jérôme Richard Oct 29 '22 at 10:06
  • Note that the frequency generally change regarding the number of core used and the kind of SIMD instruction executed so not to keep the power budget below a given fixed value (and so the processor do not melt quickly ;) ). This is why embarrassingly parallel codes do not scale perfectly. What is your exact target processor ? Some processor have two SIMD units while others have only one (in fact some ARM processor can even have more than 2 units). Some processor are big/little ones and are not well supported by many parallel software (eg. M1, Alder lake). – Jérôme Richard Oct 29 '22 at 10:13
  • By the way, is the parallel code significantly faster ? if so, it vaguely confirms the reported value of Intel advisor is per core. – Jérôme Richard Oct 29 '22 at 10:14