2

I have a vector<Mat> my_vect that each Mat is float and their size is 90*90. I start loading matrices from the disk that I load 16000 matrices to that vector. After I finish working with those matrices, I clear them. Here is my code for loading and clearing the vector:

Mat mat1(90,90,CV_32F);
load_vector_of_matrices("filename",my_vect); //this loads 16K elements
//do something
for(i = 1:16K)
    correlate(mat1, my_vect.at(i));
my_vect.clear();

I'm loading 16K element together for the sake of efficiency.

Now my question is reading all these matrices takes 3-4 second and my_vect.clear() takes approximately same amount of time which is a lot.

According to this answer, it should take O(n) time which I assume vector<Mat> doesn't have a trivial destructor.

Why clearing takes so much time, is matrix destructor overwrite each index in the matrix? Is there a way to decrease the time for clearing the vector?

EDIT: I'm using Visual Studio 2010 and the level of optimization is Maximize Speed(/O2).

Community
  • 1
  • 1
smttsp
  • 4,011
  • 3
  • 33
  • 62
  • Assuming naive `Mat` type: Floats are 32 bits (4 bytes). 1024x1024 is 1 MiB. 16,000 MiB*4 =~ 64 GB of matrices. This seems unwise. So what is your `Mat` type? – Yakk - Adam Nevraumont Sep 16 '14 at 13:13
  • You are right and I made a mistake, the size is not 1024, it is 90. Thanks for correcting. – smttsp Sep 16 '14 at 13:20
  • 1
    which compiler are you using? which level of optimization are you using? – Alessandro Teruzzi Sep 16 '14 at 13:28
  • I'm using Visual Studio 2010 and the level of optimization is Maximize Speed(/O2). – smttsp Sep 16 '14 at 13:33
  • 1
    You are deallocating half a gb of memory. Still, 3-4 seconds seems like a lot: have you actually recorded how long that step takes using something like `std::chrono`? What are you going to do with the `vector` afterwards? It makes little sense that reading from disk (and allocating) doesn't take more time than clearing them, unless there is some kind of delayed loading going on. ~100 MB/sec is about right for a HDD, so ~5 seconds is reasonable to load that much memory from disk, but freeing memory should be way faster. – Yakk - Adam Nevraumont Sep 16 '14 at 13:36
  • Can you include the code for Mat? – doron Sep 16 '14 at 13:38
  • @Yakk, yes it is half GB, but I'm keeping that half GB as one file, I mean I save `16K of (90,90) matrices in one huge file`, this way I don't need to have 16K function calls and seek for each small matrices, instead I m calling this load function once and it loads all of them together. (I tested, this way is far faster). – smttsp Sep 16 '14 at 13:46
  • @doron I'm guessing OpenCV Mat, which is relatively easy to google, and would be far too large to post. – Yakk - Adam Nevraumont Sep 16 '14 at 13:46
  • @doron I'm adding the code, actually it s just correlation – smttsp Sep 16 '14 at 13:46
  • Just an idea that may be crazy: what if instead of 16K 90x90 matrices you code everything as a single matrix of (16K*90)x90? Or if you can rearrange them in a flat column vector per matrix (a single matrix of (90*90)x16K), you may take advantage of parallelization. – ChronoTrigger Sep 16 '14 at 17:14
  • I tried the second one but opencv doesnt let you have more than 512 channels, and for some reason reading vector is faster than reading matrix of 512 channels. For the first idea, I cant use the advantage of matrices, or can I? – smttsp Sep 16 '14 at 17:23
  • What if you streamed data into matrices, notified your processing code when some are ready, and once a matrix is processed you discard it, instead of doing all the loading at once and all the unloading at once? Alternatively, will you need other matrices of the same size afterwards? – Yakk - Adam Nevraumont Sep 16 '14 at 18:35
  • @Yakk, how can I do that? Also, do you say io read and processing will be in parallel or when one stops, the other one will run? If it is the second one, I don't see any benefit. – smttsp Sep 16 '14 at 20:43

3 Answers3

2

First, a streaming loader. Provide it with a function that, given a max, returns a vector of data (aka loader<T>). It can store internal state, but it will be copied, so store that internal state in a std::shared_ptr. I guarantee that only one copy of it will be invoked.

You are not responsible for returning all max data from your loader, but as written you must return at least 1 element. Returning more is gravy, and may reduce threading overhead.

You then call streaming_loader<T>( your_loader, count ).

It returns a std::shared_ptr< std::vector< std::future< T > > >. You can wait on these futures, but you must wait on them in order (the 2nd one is not guaranteed to be ready to be waited on until the first one has provided data).

template<class T>
using loader = std::function< std::vector<T>(size_t max) >;

template<class T>
using stream_data = std::shared_ptr< std::vector< std::future<T> > >;

namespace details {
  template<class T>
  T streaming_load_some( loader<T> l, size_t start, stream_data<T> data ) {
    auto loaded = l(data->size()-start);
    // populate the stuff after start first, so they are ready:
    for( size_t i = 1; i < loaded.size(); ++i ) {
      std::promise<T> promise;
      promise.set_value( std::move(loaded[i]) );
      (*data)[start+i] = promise.get_future();
    }
    if (start+loaded.size() < data->size()) {
      // recurse:
      std::size_t new_start = start+loaded.size();
      (*data)[new_start] = std::async(
        std::launch::async,
        [l, new_start, data]{return streaming_load_some<T>( l, new_start, data );}
      );
    }
    // populate the future:
    return std::move(loaded.front());
  }
}
template<class T>
stream_data< T >
streaming_loader( loader<T> l, size_t n ) {
  auto retval = std::make_shared<std::vector< std::future<T> >>(n);
  if (retval->empty()) return retval;
  retval->front() = std::async(
    std::launch::async,
    [retval, l]()->T{return details::streaming_load_some<T>( l, 0, retval );
  });
  return retval;
}

For use, you take the stream_data<T> (aka a shared pointer to vector of future data), iterate over it, and .get() each in turn. Then do your processing. If you need a block of 50 of them, call .get() on each in turn until you get to 50 -- do not skip to number 50.

Here is a completely toy loader and test harness:

struct loader_state {
    int index = 0;
};
struct test_loader {
  std::shared_ptr<loader_state> state; // current loading state stored here
  std::vector<int> operator()( std::size_t max ) const {
    std::size_t amt = max/2+1;// really, really stupid way to decide how much to load
    std::vector<int> retval;
    retval.reserve(amt);
    for (size_t i = 0; i < amt; ++i) {
      retval.push_back( -(int)(state->index + i) ); // populate the return value
    }
    state->index += amt;
    return retval;
  }
  // in real code, make this constructor do something:
  test_loader():state(std::make_shared<loader_state>()) {}
};
int main() {
  auto data = streaming_loader<int>( test_loader{}, 1024 );
  std::size_t count = 0;
  for( std::future<int>& x : *data ) {
    ++count;
    int value = x.get(); // get data

    // process.  In this case, print out 100 in blocks of 10:
    if (count * 100 / data->size() > (count-1) * 100 / data->size())
      std::cout << value << ", ";
    if (count * 10 / data->size() > (count-1) * 10 / data->size())
      std::cout << "\n";
  }
  std::cout << std::endl;
  // your code goes here
  return 0;
}

count may or may not be worthless. The internal state of the loader above is pretty darn worthless, I just use it to demonstrate how to store some state.

You can do something similar to destroy a pile of objects without waiting for their destructors to complete. Or, you can rely on the fact that destroying your data can happen while you are working on it and waiting for the next data to load.

live example

In an industrial strength solution, you'd need to include ways to abort all this stuff, among other things. Exceptions might be one way. Also, feedback to the loader about how far behind the processing code is can be helpful (if it is at your heels, return smaller chunks -- if it is way behind, return larger chunks). In theory, that can be arranged via a back channel in loader<T>.

Now that I have played with the above for a bit, a probably better fit is:

#include <iostream>
#include <future>
#include <functional>
#include <vector>
#include <memory>

// if it returns empty, there is nothing more to load:
template<class T>
using loader = std::function< std::vector<T>() >;

template<class T>
struct next_data;

template<class T>
struct streamer {
  std::vector<T> data;
  std::unique_ptr<next_data<T>> next;
};

template<class T>
struct next_data:std::future<streamer<T>> {
    using parent = std::future<streamer<T>>;
    using parent::parent;
    next_data( parent&& o ):parent(std::move(o)){}
};

live example. It requires some infrastructure to populate that very first streamer<T>, but the code will be simpler, and the strange requirement (of knowing how much data, and only doing a .get() from the first element) goes away.

template<class T>
streamer<T> stream_step( loader<T> l ) {
    streamer<T> retval;
  retval.data = l();
  if (retval.data.empty())
    return retval;

  retval.next.reset( new next_data<T>(std::async( std::launch::async, [l](){ return stream_step(l); })));
  return retval;
}
template<class T>
streamer<T> start_stream( loader<T> l ) {
  streamer<T> retval;
  retval.next.reset( new next_data<T>(std::async( std::launch::async, [l](){ return stream_step(l); })));
  return retval;
}

A downside is that writing a ranged-based iterator becomes a bit trickier.

Here is a sample use of the second implementation:

struct counter {
  std::size_t max;
  std::size_t current = 0;
  counter( std::size_t m ):max(m) {}
  std::vector<int> operator()() {
    std::vector<int> retval;
    std::size_t do_at_most = 100;
    while( current < max && (do_at_most-->0)) {
      retval.push_back( int(current) );
      ++current;
    }
    return retval;
  }
};
int main() {
  streamer<int> s = start_stream<int>( counter(1024) );

  while(true) {
    for (int x : s.data) {
      std::cout << x << ",";
    }
    std::cout << std::endl;
    if (!s.next)
      break;
    s = std::move(s.next->get());
  }
  // your code goes here
  return 0;
}

where counter is a trivial loader (an object that reads data into a std::vector<T> in whatever sized chunks it feels like). The processing of the data is in the main code, where we just print them out in get-sized chunks.

The loading happens in a different thread, and will continue asynchronously whatever the main thread does. The main thread just gets delivered std::vector<T> to do with as they will. In your case, you'd make T a Mat.

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
1

The Mat objects are complex objects that have internal allocations of memory. When you clear the vector, one will need to iterate through every instance of Mat contained and run its destructor which is itself a non-trivial operation.

Also remember that free-store memory perations are non-trivial, so depending on your heap implementation, the heap may decide to merge cells etc.

If this is a problem, you should run your clear through a profiler and find out where the bottle-neck is.

doron
  • 27,972
  • 12
  • 65
  • 103
-1

Be careful using optimization, it can drive the debugger crazy. If you were to do this in a function and simply let the vector go out of scope?? Since the elements are not pointers I think this will work.

Alan
  • 302
  • 5
  • 22
  • A stack allocated vector/ vector allocated in a function will automatically free its contents when it goes out of scope. In this case, you don't need to call clear(). I don't know if this will offer any speed up though...Might be worth a go. – Alan Sep 16 '14 at 13:56
  • I will try this idea tomorrow morning – smttsp Sep 16 '14 at 17:23
  • I tried it, nothing has changed in terms of running time. – smttsp Sep 17 '14 at 09:15