Optimizing an algorithm using map::count

Question

I currently have an algorithm that hashes a key and checks for it's uniqueness using map::count. How could this be optimized? I also forgot to mention that this is threaded.

int coll = 0;
map<long, bool> mymap;
#pragma omp parallel for
for (int i = 0; i < 256; i++)
  for (int j = 0; j < 256; j++)
    for (int k = 0; k < 256; k++)
    {
      string temp;
      temp = i;
      temp += j;
      temp += k;
      temp += temp;
      long myhash = hash(temp.c_str());

     if (mymap.count(myhash))
      {
        #pragma omp atomic
        coll++;
        cout << "Collision at " << i << " " << j << " " << k << endl;
      }
      else
      {
        #pragma omp critical
        mymap[myhash] = true;
      }
    }

cout << "Number of collisions: " << coll << endl;
cout << "Map size: " << mymap.size() << endl;

After much trial and error, here is the best version I could produce, generating 4294967296 keys in 82.5 seconds using 1GB of RAM.

#include <iostream>
#include <string>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/time.h>
#include <iomanip>
#include <omp.h>
#include <vector>
#include <fstream>
#include <ios>
#include <unistd.h>
using namespace std;

class Timer 
{
private:
  timeval startTime;

public:

  void start()
  {
    gettimeofday(&startTime, NULL);
  }

  double stop()
  {
    timeval endTime;
    long seconds, useconds;
    double duration;

    gettimeofday(&endTime, NULL);

    seconds  = endTime.tv_sec  - startTime.tv_sec;
    useconds = endTime.tv_usec - startTime.tv_usec;

    duration = seconds + useconds/1000000.0;

    return duration;
  }

  static void printTime(double duration)
  {
    cout << setprecision(10) << fixed << duration << " seconds" << endl;
  }
};

static inline long hash(const char* str)
{
  return (*(long*)str)>> 0;
}  
int coll;
vector<bool> test;

void process_mem_usage(double& vm_usage, double& resident_set)
{
  using std::ios_base;
  using std::ifstream;
  using std::string;

  vm_usage     = 0.0;
  resident_set = 0.0;

  // 'file' stat seems to give the most reliable results
  //
  ifstream stat_stream("/proc/self/stat",ios_base::in);

  // dummy vars for leading entries in stat that we don't care about
  //
  string pid, comm, state, ppid, pgrp, session, tty_nr;
  string tpgid, flags, minflt, cminflt, majflt, cmajflt;
  string utime, stime, cutime, cstime, priority, nice;
  string O, itrealvalue, starttime;

  // the two fields we want
  //
  unsigned long vsize;
  long rss;

  stat_stream >> pid >> comm >> state >> ppid >> pgrp >> session >> tty_nr
          >> tpgid >> flags >> minflt >> cminflt >> majflt >> cmajflt
          >> utime >> stime >> cutime >> cstime >> priority >> nice
          >> O >> itrealvalue >> starttime >> vsize >> rss; // don't care about the rest

  stat_stream.close();

  long page_size_kb = sysconf(_SC_PAGE_SIZE) / 1024; // in case x86-64 is configured to use 2MB pages
  vm_usage     = vsize / 1024.0;
  resident_set = rss * page_size_kb;
}
Timer timer;
void signal_handlerkill(int sig)
{
  cout << "Number of collisions: " << coll << endl;
  //cout << test.size() << endl;
  double vm, rss;
  process_mem_usage(vm, rss);
  vm /= 1024.0;
  rss /= 1024.0;
  cout << "VM: " << vm << "MB" << endl;
  timer.printTime(timer.stop());
  exit(1);
}

int main()
{
  signal(SIGINT, signal_handlerkill);
  timer = Timer();
  timer.start();
  coll = 0;

  for (long i = 0; i < 4294967296+1; i++)
  {
    test.push_back(0); //Set up the vector
  }

  #pragma omp parallel for 
  for (int i = 0; i < 256; i++)
    for (int j = 0; j < 256; j++)
      for (int k = 0; k < 256; k++)
        for (int l = 0; l < 256; l++)
        {
          const char temp[4] = {i, j, k, l};
          long myhash = (*(long*)temp);

          if(test.at(myhash))
          {
            #pragma omp atomic
            coll++;
          }
          else
          {
            test[myhash].flip();
          }
        }

  cout << "Number of collisions: " << coll << endl;
  double vm, rss;
  process_mem_usage(vm, rss);
  vm /= 1024.0;
  rss /= 1024.0;
  cout << "VM: " << vm << "MB" << endl;
  timer.printTime(timer.stop());

  return 0;
}

Avoid converting to string, since your hash values are integers and less than 256, you can use a 24-bit key with each `i`,`j`,`k` value in a single byte. — Chad, May 15 '12 at 18:14
@Chad My [hashing function](http://stackoverflow.com/questions/628790/have-a-good-hash-function-for-a-c-hash-table) takes type const char*, and using the method I have above has no collisions over 256^3 keys. — Drise, May 15 '12 at 18:18
Maybe you could use.. another hash function? Shifting integers is far more efficient than concatenating strings, which implies dynamic memory allocation, possibly reallocation, etc. — mfontanini, May 15 '12 at 18:27
@Drise, Sure, but the way you are constructing the key, no collisions over `256 ^ 3` is exactly what is expected (`2^24`), which is exactly the same as you would get using a `24`bit key, without doing a string conversion. — Chad, May 15 '12 at 18:57
@Chad Sorry if I'm being thick, but could you demonstrate how? — Drise, May 15 '12 at 19:01
@Drise, see my answer below. The string conversion will likely be extremely expensive compared to what I've shown. My answer extrapolates that you appear to be only concerned with `6` character strings. — Chad, May 15 '12 at 19:44
That's not a very effective string hash. It's just casting it to an unsigned long. This particular input will show a good distribution, but using it with actual text will likely not (for instance, generate the hash for `Test`, `Testing`, `Testing this now`). — Chad, May 16 '12 at 15:33
@Chad I'm not concerned with string hashing, although this will be nice to keep in mind if I do. I have 3 integral keys, x, y, and z, if you wish, and I need to relate a pointer to those 3 keys. This is intended for production software, so I had to be sure that for extreme cases that there were no collisions. Typical keys are going to be no greater than 10, but I wanted to see how far I could push this. And I'm aware that I'm simply casting to a long, however, I needed a way to shrink the keys to have enough memory to hold all of them. — Drise, May 16 '12 at 15:59

score 4 · Accepted Answer · answered May 15 '12 at 18:14

4

In terms of space, you could use a set instead of a map, since the bool value is useless.

Also, if you're using C++11, an unordered_set will probably give better performance.

Also,

temp = i;
temp += j;
temp += k;
temp += temp;

probably has a bigger overhead than using a stringstream or even char arrays.

answered May 15 '12 at 18:14

Luchian Grigore

253,575
64
457
625

Good point about the concatenation. I'd be tempted to go old-school and use `sprintf`. – smocking May 15 '12 at 18:30
@smocking or even a table or hash, since the values are limited. – Luchian Grigore May 15 '12 at 18:35
Since the values are fixed (`<256`), then it is trivial to pack these into a `24`bit key as I suggested in the comment to the OP. You get the same distribution without any string conversion. – Chad May 15 '12 at 18:58
@LuchianGrigore How would you implement `set` instead of `map`? – Drise May 15 '12 at 19:05
@Drise you use find to see if the key is already there, and insert to insert a new key. – Luchian Grigore May 15 '12 at 19:14
@LuchianGrigore Implementing set::count and set::insert shaved off 3 seconds and about 1GB (halving the RAM usage if I recall correctly). – Drise May 15 '12 at 19:26

score 3 · Answer 2 · answered May 15 '12 at 18:14

3

Use insert instead of operator[]. The insert function returns a pair. The second value indicates, if the value was actually inserted, i.e. you can rewrite your code as follows:

if (!mymap.insert(std::make_pair(myhash, true)).second) {
    coll++;
    cout << "Collision at " << i << " " << j << " " << k << endl;
}

answered May 15 '12 at 18:14

nosid

48,932
13
112
139

This seems to have added 10 seconds? – Drise May 15 '12 at 18:38

score 2 · Answer 3 · edited May 23 '17 at 12:20

Well, I answered this here: https://stackoverflow.com/a/10606381/389833, and it went something like this:

    int coll = 0;
typedef map<long, bool> MY_MAP_TYPE;
MY_MAP_TYPE mymap;
string temp;
long myhash;
for (int i = 0; i < 256; i++)
    for (int j = 0; j < 256; j++)
        for (int k = 0; k < 256; k++)
        {
            temp = i;
            temp += j;
            temp += k;
            temp += temp;
            myhash = hash(temp.c_str());
            if( mymap.insert( MY_MAP_TYPE::value_type( myhash, true ) ).second == false)
            {
                coll++;
                cout << "Collision at " << i << " " << j << " " << k << endl;
            }
        }

This, along with another answer, seems to add 10 seconds. – Drise May 15 '12 at 19:06 — Drise, May 15 '12 at 19:06

score 1 · Answer 4 · answered May 15 '12 at 18:26

1

Depending on the size of your hash you could trade space for CPU time and just use a bool vector rather than a map for constant-time lookup. If the range is 0 - 256³ (the number of unique values here), it should only take about 2 MB since STL vectors in many implementations will internally compact bool vectors to bits. Of course this won't be efficient (or perhaps work at all) if your hash function can return very large values like 2³² or even 2⁶⁴.

answered May 15 '12 at 18:26

smocking

3,689
18
22

Space is less of a problem compared to CPU time. I threaded it, thanks to @Tudor, to solve this problem, as I would like to add a 4th key, giving 256^4 values, and **giant** lack of RAM issue. – Drise May 15 '12 at 18:45
You actually were right on, smocking. I would give you another 5+ if I could. – Drise May 16 '12 at 04:01

Chad · Answer 5 · 2012-05-15T23:48:10.840

If you are only concerned with 6 character strings, then you can easily optimize the loop(s) you are generating by the following:

for (int i = 0; i < 256; i++)
  for (int j = 0; j < 256; j++)
    for (int k = 0; k < 256; k++)
    {
/*
      string temp;
      temp = i;
      temp += j;
      temp += k;
      temp += temp;
      myhash = hash(temp.c_str());
*/  
      // effectively, the same as above
      const char temp[7] = {i, j, k, i, j, k, '\0'};
      myhash = hash(temp);
    }

The above combined with the insert as suggested also, should provide a nice performance boost.

EDIT:

So, you comment below on this version being "slower" makes me really question:

How you are profiling
The implementation of your hash function

These are questionable, because running this code on my machine (ignore the 3.3GHz magic number for now, as that is the speed of my CPU):

#include <iostream>
#include <vector>
#include <boost/functional/hash.hpp>
#include <x86intrin.h>

using namespace std;

uint64_t f(std::vector<uint64_t>& values)
{
    boost::hash<std::string> hasher;

    uint64_t start = __rdtsc();
    int z = 0;

    for (int i = 0; i < 256; i++)
    {
      for (int j = 0; j < 256; j++)
      {
        for (int k = 0; k < 256; k++)
        {
          string temp;
          temp = i;
          temp += j;
          temp += k;
          temp += temp;

          values[z++] = hasher(temp);
        }
      }
    }

    return (__rdtsc()) - start;
}

uint64_t g(std::vector<uint64_t>& values)
{
    boost::hash<std::string> hasher;

    uint64_t start = __rdtsc();
    int z = 0;

    for (int i = 0; i < 256; i++)
    {
      for (int j = 0; j < 256; j++)
      {
        for (int k = 0; k < 256; k++)
        {
          const char temp[7] = {i, j, k, i, j, k, '\0'};
          values[z++] = hasher(std::string(temp, 6));
        }
      }
    }

    return (__rdtsc()) - start;
}

static const double freq = 3300000000.0;
static const int elements = 1024 * 1024 * 16;

int main()
{
    std::vector<uint64_t> values_f(elements);
    std::vector<uint64_t> values_g(elements);

    uint64_t delta_f = f(values_f);
    uint64_t delta_g = g(values_g);

    cout << "F: " << (delta_f * 1000.0) / freq << "ms \n";
    cout << "G: " << (delta_g * 1000.0) / freq << "ms \n";

    for(int x = 0; x < elements; ++x)
    {
        if(values_f[x] != values_g[x])
        {
            cout << "Error: Expected "
                 << values_f[x] << " received "
                 << values_g[x] << "!\n";
        }
    }

    return 0;
}

Gives this output:

F: 3297.17ms 
G: 736.444ms

Showing that the version that constructs the std::string (which wouldn't even technically be necessary) performs much better than the version that does the concatenation. The difference in my case being the use of boost::hash (and obviously using a std::vector instead of a std::map or std::set, but that doesn't bias the test towards either of the results.

The hash function I borrowed just so happened to work for what I needed. I am not at all concerned with 6 characters, however I could not devise a way to go from my integer keys to the const char* of the example I found. — Drise, May 15 '12 at 19:47
With: Number of collisions: 0 Map size: 16777216 VM: 841.797MB 22.6494870000 seconds Without: Number of collisions: 0 Map size: 16777216 VM: 1053.76MB 21.4004230000 seconds — Drise, May 15 '12 at 20:12
@Drise, this just raises questions about your profiling code and the hash implementation. See my edit above, for a profiled (with `-O2` optimizations) build, and the result is along the lines of what I expected. — Chad, May 15 '12 at 23:49
[Here](http://pastebin.com/TGKzEjnS) is my complete code. I know its horrible in terms of programming etiquette, but this is a side project mostly for my own entertainment. As shown there, my stats are: Number of collisions: 0 Map size: 16777216 VM: 841.801MB 9.6751800000 seconds — Drise, May 16 '12 at 00:22
When I use the `std::string`, I get: Number of collisions: 0 Map size: 16777216 VM: 1067.36MB 8.6875100000 seconds . Notice the increase of 220MB, but the decrease of 1 second. — Drise, May 16 '12 at 00:29
What do you get if you run my code above? Change `frequency` based on your CPU... — Chad, May 16 '12 at 02:28
I don't have boost and can't use it, otherwise I would. I found a major bottleneck in the program when using set::count. I cooked up a vector implementation that I'll share when I get into work tomorrow. Number of collisions: 0 Map size: 4294967296 VM: ~1080MB ~82 seconds. And yes, that is 52377650 keys generated and checked for uniqueness per second. — Drise, May 16 '12 at 04:00

Optimizing an algorithm using map::count

5 Answers5

Linked