5

I am trying to sort and find the median of a string of integers that only contains 3 to 4 different integers.

The amount of numbers I am dealing with is of magnitudes of about 20 to 25 million and I am supposed to sort the vector and find the median each time a new integer is added into the vector and add the median into a separate "Total" variable which sums up all the medians each time a median is generated.

1                   Median: 1              Total: 1
1 , 2               Median: (1+2)/2 = 1    Total: 1 + 1 = 2
1 , 2 , 3           Median: 2              Total: 2 + 2 = 4
1 , 1 , 2 , 3       Median: (1+2)/2 = 1    Total: 4 + 1 = 5
1 , 1 , 1 , 2 , 3   Median: 1              Total: 5 + 1 = 6

I am trying to find a way to optimize my code further because it is just not efficient enough. (Got to process under 2s or so) Does anyone have any idea how to further speed up my code logic?

I am currently using 2 heaps, or priority queues in C++. One functioning as a max-heap and the other functioning as a min-heap.

Gotten the idea from Data structure to find median

You can use 2 heaps, that we will call Left and Right.
Left is a Max-Heap.
Right is a Min-Heap.
Insertion is done like this:

If the new element x is smaller than the root of Left then we insert x to 
Left.
Else we insert x to Right.
If after insertion Left has count of elements that is greater than 1 from 
the count of elements of Right, then we call Extract-Max on Left and insert 
it to Right.
Else if after insertion Right has count of elements that is greater than the 
count of elements of Left, then we call Extract-Min on Right and insert it 
to Left.
The median is always the root of Left.

So insertion is done in O(lg n) time and getting the median is done in O(1) 
time.

but it is just not fast enough...

deviljones
  • 117
  • 6
  • 8
    If the number of distinct elements is very small, you can just count the number of times each one appears in linear time and then compute the median from there. – Max Langhof Aug 28 '18 at 16:12
  • there is no limit to how often each number appears, as long as they total up to a max of about 25 million. But there are only 3 to 4 different numbers that will appear. – deviljones Aug 28 '18 at 16:17
  • 2
    That wasn't his point - if you know apriori what the numbers are its trivial to just count the number of times each distinct number appears in its own bucket, and then determine the median based on the number of items in all of the buckets. – Alnitak Aug 28 '18 at 16:18
  • why it is not fast enough oO In general from algorithmic point of view faster than O(ln n) and O(1) I don't think faster is possible. But if as mention special situation Max Langhof you have very small amount of different numbers. You can compute input O(1) and O(1) for finding by counting how much time number is appear. – Vage Egiazarian Aug 28 '18 at 16:25
  • @VageEgiazarian You can't go through the input in O(1) time, and the original algorithm is not O(ln N) either. – Max Langhof Aug 28 '18 at 16:27
  • @ Max Langhof O(ln N ) is insertion time for one element. Suppose you know can't have more than k different numbers. You can have a struct of x,y where x is number name and y is how much time is that number was inputed . If k << n(for example 4). You can "insert" one number O(1), can't you? – Vage Egiazarian Aug 28 '18 at 16:37
  • @Alnitak i assume you mean like having 3 or 4 separate counters for each distinct integer? – deviljones Aug 28 '18 at 16:38
  • @deviljones yes, exactly. The answer below from Max Langhof does just that, although it uses a slightly expensive `std::map` to store the per-integer counts. The minor problem with that is that it's typically `O(log n)` on the number of distinct numbers to update the map. – Alnitak Aug 28 '18 at 18:06
  • @deviljones what are the *actual* inputs? You said in the question that there might be four different values? Could it ever actually be five? Are they just `1 ... n ` ? – Alnitak Aug 30 '18 at 10:48
  • @Altnitak its okay! I solved it with The logic mentioned by Max – deviljones Sep 01 '18 at 07:46

2 Answers2

9

If you only ever have three to four distinct integers in the string, you can just keep track of how many times each one appears by traversing the string once. Adding (and removing elements) from this representation is also doable in constant time.

class MedianFinder
{
public:
  MedianFinder(const std::vector<int>& inputString)
  {
    for (int element : inputString)
      _counts[element]++; // Inserts 0 into map if element is not in there.
  }

  void addStringEntry(int entry)
  {
    _counts[entry]++;
  }

  int getMedian() const
  {
    size_t numberOfElements = 0;
    for (auto kvp : _counts)
      numberOfElements += kvp.second;

    size_t cumulativeCount = 0;
    int lastValueBeforeMedian;
    for (auto kvp : _counts)
    {
      cumulativeCount += kvp.second;
      if (cumulativeCount >= numberOfElements/2)
        lastValueBeforeMedian = kvp.first;
    }

    // TODO! Handle the case of the median being in between two buckets.
    //return ...
  }

private:
  std::map<int, size_t> _counts;
};

The trivial task of summing the medians is not shown here.

Max Langhof
  • 23,383
  • 5
  • 39
  • 72
2

I would not focus on optimizing as much as decreasing the complexity from O(n * log n) to O(n).

Your algorithm is O(n * log n) because you do n insertions each costing amortized O(log n) time.

There is a well known O(n) algorithm for median finding. I suggest using this.

Usually log n is not a big deal, but for 20 Million elements it can make your algorithm ~25 times faster.

Oh, my bad. I didn't realize there are only 3-4 different integers...

Community
  • 1
  • 1
Kostas
  • 4,061
  • 1
  • 14
  • 32
  • 5
    Although this is the best solution for the *general* case, it is still needlessly laborious given the constraints for OP's particular problem. – meowgoesthedog Aug 28 '18 at 16:26
  • 1
    Although this is the best solution if we asked median not very often and insertion is often. But in general, when insertion and asking time is not that different it would work much slower. – Vage Egiazarian Aug 28 '18 at 16:30