3

I have a class for a serial-memory 2D array that was initially an array of ints. Now that I need a similar array with another type, I've rewritten the class with templates; the only difference is in the type of stored objects:

template <class T>
class Serial2DArray
{
    ...
    T ** Content;
}

I have a few test functions that deal with Content, for example, a one that nullifies all elements in the array (they're not class members, they're outside functions working with Serial2DArray<int> objects. I've noticed that now it works 1-2% slower - all the other code in the class is untouched, the only difference is that earlier it was just a regular class with int ** Content and now it's a template.

A sort of similar question: Do c++ templates make programs slow? - has opinions that only compilation becomes slower (and I can see why, the compiler generates classes for each that it finds in the code), but here I see the program becoming slower in run-time - is there any rational explanation?

Upd: issue narrowed down a little bit here: https://stackoverflow.com/a/11058672/1200000

Upd2: as mentioned in the comments, here's the function that became slower:

#include <windows.h>
#include <mmsystem.h>
...
int Size = G_Width * G_Height * sizeof(int);
DWORD StartTime = timeGetTime();
for(int i=0; i<100; ++i)
{
    FillMemory(TestArray.Content[0], Size, 0);
}
MeasuredTime = timeGetTime() - StartTime;

And here's the actual class template:

#include <malloc.h>

template <class T>
class Serial2DArray
{
    public:
    Serial2DArray()
    {
        Content = NULL;
        Width = 0;
        Height = 0;
    }
    Serial2DArray(int _Width, int _Height)
    {
        Initialize(_Width, _Height);
    }
    ~Serial2DArray()
    {
        Deinitialize();
    }
    T ** Content;
    int GetWidth()
    {
        return Width;
    }
    int GetHeight()
    {
        return Height;
    }
    int Initialize(int _Width, int _Height)
    {
        // creating pointers to the beginning of each line
        if((Content = (T **)malloc(_Height * sizeof(T *))) != NULL)
        {
            // allocating a single memory chunk for the whole array
            if((Content[0] = (T *)malloc(_Width * _Height * sizeof(T))) != NULL)
            {
                // setting up line pointers' values
                T * LineAddress = Content[0];
                for(int i=0; i<_Height; ++i)
                {
                    Content[i] = LineAddress; // faster than Content[i] =
                    LineAddress += _Width;    // Content[0] + i * _Width;
                }
                // everything went ok, setting Width and Height values now
                Width = _Width;
                Height = _Height;
                // success
                return 1;
            }
            else
            {
                // insufficient memory available
                // need to delete line pointers
                free(Content);
                return 0;
            }
        }
        else
        {
            // insufficient memory available
            return 0;
        }
    }
    int Resize(int _Width, int _Height)
    {
        // deallocating previous array
        Deinitialize();
        // initializing a new one
        return Initialize(_Width, _Height);
    }
    int Deinitialize()
    {
        // deleting the actual memory chunk of the array
        free(Content[0]);
        // deleting pointers to each line
        free(Content);
        // success
        return 1;
    }
    private:
    int Width;
    int Height;
};

As requested, binaries size comparison.

Code with the following:

Serial2DArray<int> TestArray; 
Serial2DArray<int> ZeroArray;
  • 1 016 832 bytes.

Code with the following:

Serial2DArray TestArray; // NOT-template class with ints
Serial2DArray ZeroArray; // methods are in class declaration
  • 1 016 832 bytes

Code with the following:

Serial2DArray<int> TestArray;
Serial2DArray<int> ZeroArray;
Serial2DArray<double> AnotherArray;
Serial2DArray<double> YetAnotherArray;
  • 1 017 344 bytes
Community
  • 1
  • 1
Fy Zn
  • 151
  • 7
  • Regarding the answers to the post you provided a link to, from my point of view they are all plain false. Take care of the false sense of .. performace the c++ template gives you. As well as for inlining everything in your code. It is a false widespread common idea... Take some time to read the Bruce Eckel's book "thinking in C++" Vol1 or 2 (don't remember in which one the template code bloat topic is tackled) and read his way to deal with the issue. Very interesting. – yves Baumes Jun 15 '12 at 21:53
  • (originally posted as a comment to your answer where you say that the real issue comes from putting all code in a header) What compiler is this? It seems that this would only give the compiler more information (and would not force any pessimization such as always inlining). – David Stone Jun 15 '12 at 22:03
  • Thanks for adding code, but it's not enough. Post a [_self-contained_](http://sscce.org/) repro. – ildjarn Jun 15 '12 at 22:04
  • @ildjarn posted full class code and code of the test function. Is that OK now? As for the compiler, Embarcadero RAD Studio 2010 C++ Builder. – Fy Zn Jun 15 '12 at 22:07
  • "*Embarcadero RAD Studio 2010 C++ Builder*" I think we've found the problem. _Not_ a great compiler... – ildjarn Jun 15 '12 at 22:11
  • @fynjyzn May you compare the binaries size, with one class template instance and with two class template instances please? And post the results here, please. – yves Baumes Jun 15 '12 at 22:11
  • Edited this comment into the main question text. – Fy Zn Jun 15 '12 at 22:17
  • @fynjyzn : I think the result you're providing here joins the point in my posts. A sightly heavier binary may finally lead to more page faults. But then as ildjarn suggested in his comments, try to use a compiler with a better template implementation which would squeeze every template instanciations as much as it can do. – yves Baumes Jun 15 '12 at 22:21
  • 1
    @yvesBaumes well, there was an additional information that I posted as an answer (and that got deleted) - that NOT-template class which had methods in class declaration also was slow, while NOT-template class with methods in a separate .cpp (and headers in .h in class declaration) was fast. Otherwise these 2 versions were identical (just moved actual methods from/to class declaration). So I'm starting to think that maybe it's not an issue with templates at all?.. – Fy Zn Jun 15 '12 at 22:25
  • 1
    @fynjyzn About the slow NOT-Template class with declaration inside the class, from my point of view the reason could be the following: a method declared inside a class in considered implicitely by the compiler as an inlining request from the developer. Just think about accessors like usual getter and setter, you don't need to put the inline keyword to request an inlining. Then it would join the code bloat issue, but it would also mean that your compiler did follow the request while it must have not follow it (...) – yves Baumes Jun 15 '12 at 22:30
  • (...)and that there was more than one template instance that could have been shared in the final binary. Well that is just some though about the issue, and you raise great issues here. And you may be right, the actual reason may be elsewhere. – yves Baumes Jun 15 '12 at 22:31
  • @yvesBaumes sounds like a reason, yes. But then if I want to use templates, I have no other choice other than putting methods into class declaration (if I put them into a separate .cpp, I have linking fails, http://www.parashift.com/c%2B%2B-faq-lite/templates.html#faq-35.12 - explanation) – Fy Zn Jun 15 '12 at 22:34
  • @fynjyzn When using template you're right: it is mandatory to declare methods with the class. I was talking about the NON-Template class in my last comment. It would be great to test your example with other compiler and compare performances as well as the binary code layout, while quite uneasy for me (not really used to read assembler). – yves Baumes Jun 15 '12 at 22:39
  • I just noticed, this is all about 1-2%. How long is the test, and how many iterations are you doing in a single run, and does that 1-2% go down if you prime things by running over and over? In other words, is it conceivable that all you've measured is the extra overhead to load one more block or map one more page at startup? – abarnert Jun 15 '12 at 23:06
  • @abarnert it's quite sustainable; running 100 iterations (2.9 vs 2.8 seconds), or 1000 (29 vs 28). Also, I measured the exact time of for() loop, not including preparations like initializing the array etc. – Fy Zn Jun 15 '12 at 23:21
  • 1
    Hold on, if it's 2.9 vs. 2.8 or 29 vs. 28, and it's completely sustainable, why does the original question say "I've noticed that now it works 1-2% slower"? If you've got error bars wide enough that it's appropriate to call 3.5% about 1-2%, then the measurement can't be very useful. – abarnert Jun 15 '12 at 23:37
  • @abarnert my mistake, sorry, rounded numbers for the answer. It's fluctuating around 2.83 vs 2.87 (more or less, +\- 0.01 in both cases), didn't realize that rounding messed up the %. – Fy Zn Jun 16 '12 at 00:09

2 Answers2

2

Yeah- random benchmark variability, not to mention the fact that the whole program is slower might have nothing at all to do with this specific class.

Puppy
  • 144,682
  • 38
  • 256
  • 465
  • 1
    This is a small program I've built to test this particular class, it only has an object of this class and 2 different functions that turn elements into 0, which haven't been modified. And this 1-2% slowdown is consistent on different machines and on many subsequent tests. – Fy Zn Jun 15 '12 at 21:37
  • Did you compile both versions with optimizations? – Eitan T Jun 15 '12 at 21:40
  • Yes, and both of them with identical settings. – Fy Zn Jun 15 '12 at 21:46
  • @fynjyzn : That's not what was asked. Both versions may have been compiled in debug mode.. – ildjarn Jun 15 '12 at 21:51
  • @ildjarn I understand; both were in Release mode. – Fy Zn Jun 15 '12 at 21:55
-2

Using templates in your container class may lead to the known issue of template code bloat . Roughly it could lead to more page fault in your program decreasing performance.

So why would you ask ? Because a template would generate the classes for each class instance of your template instead of one, leading to more pages in your binary product, more code pages if you prefer. Which could statistically lead to more page fault, depending on your run-time execution.

Have a look to the size of your binary with one class template instance, and with two instances which must be the heaviest. It will give you a grasp of the new code size introduced with the new instance.

Here is the wikipedia article on that topic: Code bloat article. The issue could be the same when forcing the compiler to inline every functions and methods in your program, if only it could be available with your compiler. The standard tries to prevent that with making the inline keyword a "request" that the compiler must not follow everytime. For instance GCC produces your code in an intermediate langage in order to evaluate if the resulting binary won't be lead to code bloat, and may discard the inline request as a result.

yves Baumes
  • 8,836
  • 7
  • 45
  • 74
  • 2
    The known non-issue, more like. For one, the compiler can fold template instances with identical assembler. And for two, more code only means more page faults if you jump about in it quite randomly. – Puppy Jun 15 '12 at 21:38
  • For the point one, you may be right with recent compiler. But if so recently it hasn't be the case for a long time. Take a look at the Bruce Eckel's book "thinking in c++". He deals with the issue in his book and it can be quite tricky.. – yves Baumes Jun 15 '12 at 21:44
  • for the second point, you are absolutely right. But then the OP could have reached that particular path in his program. Unfortunate, but ok statistically. – yves Baumes Jun 15 '12 at 21:46
  • @yves : For the first point, at _least_ 7 years -- not exactly recent. – ildjarn Jun 15 '12 at 21:51
  • @ildjarn : I don't know every compiler implementation on earth. Feel free to test with your own report the vendor, compiler issue and the result of your tests. I would be interested in so. But my own recent tests with GCC 3.4 showed the same kind of performance issues as the OP, with very simple test program. – yves Baumes Jun 15 '12 at 21:58
  • @yves : It's a bit hard to test considering the OP didn't actually post any code. ;-] Maybe if they posted an [SSCCE](http://sscce.org/)... – ildjarn Jun 15 '12 at 21:59
  • @ildjarn Such a test is damn easy to implement for any 7 years long experimented c++ developer.... – yves Baumes Jun 15 '12 at 22:00
  • @yves : Not really considering the _real_ problem is probably something pathological in the OP's code. Why would I waste my time to prove something different than what the OP is dealing with in the first place? – ildjarn Jun 15 '12 at 22:01
  • @ildjarn That something you learn quickly when you're growing as a developer: TEST YOURSELF, and don't rely on others common beliefs. – yves Baumes Jun 15 '12 at 22:04
  • 1
    @yves : Ironic, since _you're_ the one spreading misinformation about code bloat. – ildjarn Jun 15 '12 at 22:05
  • @yvesBaumes: GCC 3.4 is an 8-year-old compiler, so asserting that it provides the same performance issues as the OP doesn't really answer the fact that compilers haven't had this problem for at least 7 years… – abarnert Jun 15 '12 at 22:21
  • More seriously: a program that does almost nothing but run this test over and over will probably have very different cache and VM characteristics than a real program. When people design optimizers, they generally worry about more real-world use cases than trivial benchmarks. (Except sometimes Intel.) – abarnert Jun 15 '12 at 22:23
  • @abarnert I agree with you, absolutely. – yves Baumes Jun 15 '12 at 22:25