11

One of my co-workers has been reading Clean Code by Robert C Martin and got to the section about using many small functions as opposed to fewer large functions. This led to a debate about the performance consequence of this methodology. So we wrote a quick program to test the performance and are confused by the results.

For starters here is the normal version of the function.

static double NormalFunction()
{
    double a = 0;
    for (int j = 0; j < s_OuterLoopCount; ++j)
    {
        for (int i = 0; i < s_InnerLoopCount; ++i)
        {
            double b = i * 2;
            a = a + b + 1;
        }
    }
    return a;
}

Here is the version I made that breaks the functionality into small functions.

static double TinyFunctions()
{
    double a = 0;
    for (int i = 0; i < s_OuterLoopCount; i++)
    {
        a = Loop(a);
    }
    return a;
}
static double Loop(double a)
{
    for (int i = 0; i < s_InnerLoopCount; i++)
    {
        double b = Double(i);
        a = Add(a, Add(b, 1));
    }
    return a;
}
static double Double(double a)
{
    return a * 2;
}
static double Add(double a, double b)
{
    return a + b;
}

I use the stopwatch class to time the functions and when I ran it in debug I got the following results.

s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 377 ms;
TinyFunctions Time = 1322 ms;

These results make sense to me especially in debug as there is additional overhead in function calls. It is when I run it in release that I get the following results.

s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 173 ms;
TinyFunctions Time = 98 ms;

These results confuse me, even if the compiler was optimizing the TinyFunctions by in-lining all the function calls, how could that make it ~57% faster?

We have tried moving variable declarations around in NormalFunctions and it basically no effect on the run time.

I was hoping that someone would know what is going on and if the compiler can optimize TinyFunctions so well, why can't it apply similar optimizations to NormalFunction.

In looking around we found where someone mentioned that having the functions broken out allows the JIT to better optimize what to put in the registers, but NormalFunctions only has 4 variables so I find it hard to believe that explains the massive performance difference.

I'd be grateful for any insight someone can provide.

Update 1 As pointed out below by Kyle changing the order of operations made a massive difference in the performance of NormalFunction.

static double NormalFunction()
{
    double a = 0;
    for (int j = 0; j < s_OuterLoopCount; ++j)
    {
        for (int i = 0; i < s_InnerLoopCount; ++i)
        {
            double b = i * 2;
            a = b + 1 + a;
        }
    }
    return a;
}

Here are the results with this configuration.

s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 91 ms;
TinyFunctions Time = 102 ms;

This is more what I expected but still leaves the question as to why order of operations can have a ~56% performance hit.

Furthermore, I then tried it with integer operations and we are back to not making any sense.

s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 87 ms;
TinyFunctions Time = 52 ms;

And this doesn't change regardless of the order of operations.

Jim G.
  • 15,141
  • 22
  • 103
  • 166
Pumices
  • 315
  • 1
  • 10
  • Probably because in release mode small functions can be optimized by inlining. Also, [tail call optimization](https://en.wikipedia.org/wiki/Tail_call) might be part of the difference. Optimizing the "NormalFunction" is more difficult because it's much harder for the compiler to recognize possible optimizations. – GreatAndPowerfulOz Jul 18 '17 at 18:01
  • Those would explain why TinyFunctions could perform as fast as NormalFunction but it doesn't explain to me how it performs so much faster. Because NormalFunctions is pretty much a pre-in-lined version of TinyFunctions – Pumices Jul 18 '17 at 18:04
  • I'm guessing it might be the double addition getting treated differently try Normal with a = a + (b + 1); – user6144226 Jul 18 '17 at 18:06
  • One way to learn what's going on is to look at the IL code generated by using ILSPY – GreatAndPowerfulOz Jul 18 '17 at 18:07
  • We tried with both double and int addition and tried other operations as well and it did not change the effects. We also looked at the IL code as well and while I am no expert at IL code it looked to be about what I expected. As I would have expected TinyFunctions has more instructions than NormalFunctions and there is nothing in the IL code that makes me understand what is going on as I think the performance gain is in the JIT. – Pumices Jul 18 '17 at 18:10
  • 1
    paging eric lippert or jon skeet – pm100 Jul 18 '17 at 18:26

1 Answers1

8

I can make performance match much better by changing one line of code:

a = a + b + 1;

Change it to:

a = b + 1 + a;

Or:

a += b + 1;

Now you'll find that NormalFunction might actually be slightly faster and you can "fix" that by changing the signature of the Double method to:

int Double( int a ) { return a * 2; }

I thought of these changes because this is what was different between the two implementations. After this, their performance is very similar with TinyFunctions being a few percent slower (as expected).

The second change is easy to explain: the NormalFunction implementation actually doubles an int and then converts it to a double (with an fild opcode at the machine code level). The original Double method loads a double first and then doubles it, which I would expect to be slightly slower.

But that doesn't account for the bulk of the runtime discrepancy. That comes almost down entirely to that order change I made first. Why? I don't really have any idea. The difference in machine code looks like this:

Original                                                    Changed
01070620  push        ebp                                   01390620  push        ebp  
01070621  mov         ebp,esp                               01390621  mov         ebp,esp  
01070623  push        edi                                   01390623  push        edi  
01070624  push        esi                                   01390624  push        esi  
01070625  push        eax                                   01390625  push        eax  
01070626  fldz                                              01390626  fldz  
01070628  xor         esi,esi                               01390628  xor         esi,esi  
0107062A  mov         edi,dword ptr ds:[0FE43ACh]           0139062A  mov         edi,dword ptr ds:[12243ACh]  
01070630  test        edi,edi                               01390630  test        edi,edi  
01070632  jle         0107065A                              01390632  jle         0139065A  
01070634  xor         edx,edx                               01390634  xor         edx,edx  
01070636  mov         ecx,dword ptr ds:[0FE43B0h]           01390636  mov         ecx,dword ptr ds:[12243B0h]  
0107063C  test        ecx,ecx                               0139063C  test        ecx,ecx  
0107063E  jle         01070655                              0139063E  jle         01390655  
01070640  mov         eax,edx                               01390640  mov         eax,edx  
01070642  add         eax,eax                               01390642  add         eax,eax  
01070644  mov         dword ptr [ebp-0Ch],eax               01390644  mov         dword ptr [ebp-0Ch],eax  
01070647  fild        dword ptr [ebp-0Ch]                   01390647  fild        dword ptr [ebp-0Ch]  
0107064A  faddp       st(1),st                              0139064A  fld1  
0107064C  fld1                                              0139064C  faddp       st(1),st  
0107064E  faddp       st(1),st                              0139064E  faddp       st(1),st  
01070650  inc         edx                                   01390650  inc         edx  
01070651  cmp         edx,ecx                               01390651  cmp         edx,ecx  
01070653  jl          01070640                              01390653  jl          01390640  
01070655  inc         esi                                   01390655  inc         esi  
01070656  cmp         esi,edi                               01390656  cmp         esi,edi  
01070658  jl          01070634                              01390658  jl          01390634  
0107065A  pop         ecx                                   0139065A  pop         ecx  
0107065B  pop         esi                                   0139065B  pop         esi  
0107065C  pop         edi                                   0139065C  pop         edi  
0107065D  pop         ebp                                   0139065D  pop         ebp  
0107065E  ret                                               0139065E  ret  

Which is opcode-for-opcode identical except for the order of the floating point operations. That makes a huge performance difference but I don't know enough about x86 floating point operations to know why exactly.

Update:

With the new integer version we see something else curious. In this case it seems the JIT is trying to be clever and apply an optimization because it turns this:

int b = 2 * i;
a = a + b + 1;

Into something like:

mov esi, eax              ; b = i
add esi, esi              ; b += b
lea ecx, [ecx + esi + 1]  ; a = a + b + 1

Where a is stored in the ecx register, i in eax, and b in esi.

Whereas the TinyFunctions version gets turned into something like:

mov         eax, edx  
add         eax, eax  
inc         eax  
add         ecx, eax  

Where i is in edx, b is in eax, and a is in ecx this time around.

I suppose for our CPU architecture this LEA "trick" (explained here) ends up being slower than just using the ALU proper. It is still possible to change the code to get the performance between the two to line up:

int b = 2 * i + 1;
a += b;

This ends up forcing the NormalFunction approach to end up getting turned into mov, add, inc, add as it appears in the TinyFunctions approach.

Community
  • 1
  • 1
Kyle
  • 6,500
  • 2
  • 31
  • 41
  • I have confirmed that this is the case, thanks for pointing it out. This is perhaps just as confusing, how can be there that big of a performance difference resulting from such a small change and why doesn't the compiler optimize this? – Pumices Jul 18 '17 at 19:31
  • @Pumices I would guess it doesn't optimize it for a couple reasons: First, I would be surprised if this was a "known" optimization. I don't know a lot about the FPU, but I can't imagine that switching the order of 2 opcodes like this should result in such a huge performance difference. Second, I would argue that it shouldn't change the order of evaluation since in other cases that could lead to a change in behavior (such as catastrophic cancellation or other unusual rounding issues). – Kyle Jul 18 '17 at 19:40
  • Ok, just to add another wrinkle to the issue, I tried moving everything over to integers and found that that operation switch has no effect on the run time of the functions and even worse, TinyFunctions is now ~56% faster again regardless of the order of operations. – Pumices Jul 18 '17 at 19:45
  • Interesting, once again you are correct. My faith in compiler optimizations is slowly waning. This does bring the two back in line but I am still seeing a 5-10% speed improvement with the TinyFunctions. This is far better than ~56% but I'm still confused by this. – Pumices Jul 18 '17 at 20:39
  • And yet `a += (Double(j) + 1);` is twice as slow `int b = Double(j)+1; a += b;`. The C# compiler really seems to have room for improvement. – NetMage Jul 18 '17 at 20:42
  • @Pumices When profiling also make sure that "Prefer 32-bit" is disabled in the project settings. When disabling that the performance difference favors `NormalFunction` in the integer case. – Kyle Jul 18 '17 at 20:44
  • @NetMage Technically, it's not the C# compiler. It's the JIT implementation that's responsible for these differences. And regarding improving it, it's not necessarily obvious to me how it could be actually improved. From the findings I presented these sorts of optimizations are probably highly dependent on the exact processor architecture being targeted (especially that LEA optimization). These are also micro-optimizations which will almost certainly be completely shadowed by other code in a typical production application. – Kyle Jul 18 '17 at 20:51
  • @Kyle Disabling "Prefer 32-Bit" did indeed get the NormalFunction to perform better. I agree that big picture these optimizations would not greatly affect the performance of the program but I am shocked to see just how bit of a difference there are between them. I think you have pretty well gotten to the bottom of this, thank you for your time and knowledge. – Pumices Jul 18 '17 at 20:59
  • I'm not sure I agree - I really think the compiler should do local variable elimination generating IL code and the two samples in my comment should run identically independant of optimization, but they do not. – NetMage Jul 18 '17 at 23:01
  • @NetMage I don't know how the compiler could make that guarantee in general. As we saw with the `double` case we got significantly different runtime characteristics just by swapping the order of 2 opcodes. There's also other subtleties with that expression since the compiler would have to prove that it could evaluate the value of `b` before evaluating the value of `a` in `a += b` since the compiler guarantees left-to-right evaluation order unless it can prove otherwise. In this specific case we definitely can (which eliminates the `b` variable), but I can't say that would be easy in general. – Kyle Jul 18 '17 at 23:32
  • 1
    If there exists a case where `var b = f(x); a += b;` has a different meaning than `a += f(x);` then I think there is a language problem, not a compiler problem. – NetMage Jul 18 '17 at 23:57
  • @NetMage `var b = a++; a += v;` Contrived, I know, but it's easy to fit into a comment and the compiler has to be able to handle cases like that anyway. The function `f` could also have more benign side effects that don't affect the value of `a`, either but do also depend on it. And this is really the point, compiler writers would have to invest a lot of effort into analyzing all these cases and testing them just for a feature of relatively questionable value. It'd also be a source of technical debt when trying to make future changes to the language (like the ref returns we're getting now). – Kyle Jul 19 '17 at 00:06
  • I assume you meant `a += b;`? – NetMage Jul 19 '17 at 00:38
  • I assume you meant `a += b;`? Thinking about it more, I don't think I find that case particularly compelling. You are essentially evaluating `a + a++` and is that even defined? – NetMage Jul 19 '17 at 00:53