0

I am trying to group multiple times to create a NxM matrix of actions over a very large data set.

I have people who can perform actions (set size 10) in locations (set size 1000 per operator, 5 million possible) and I want to produce a report that gives me:

for each operator
    for each action
        a total count of this type of action performed by this operator
        the top N locations where this action was performed by this operator

My input data looks like this:

{ time: 1, operator: 'John', action: 'up', location: 'a' },
{ time: 2, operator: 'Jane', action: 'down', location: 'b' },
{ time: 3, operator: 'John', action: 'down', location: 'a' },
{ time: 4, operator: 'Sean', action: 'charm', location: 'c' },
{ time: 5, operator: 'John', action: 'up', location: 'a' },
{ time: 6, operator: 'Jane', action: 'down', location: 'c' },
...

So for the first stage of the group, I do:

$group: {
    _id: {
        operator: '$operator',
        action: '$action',
        location: '$location',
    },
    count: {$sum: 1}
}

to create:

{ operator: 'John': action: 'up', location: 'a', count: 2},
{ operator: 'John': action: 'down', location: 'a', count: 1},
{ operator: 'Jane': action: 'down', location: 'b', count: 1},
{ operator: 'Jane': action: 'down', location: 'c', count: 1},
{ operator: 'Sean': action: 'charm', location: 'c', count: 1}

Now I want to count every action performed by the operator (could be thousands) but only retain the top 5 locations each operator performed each operation... I want my final output to have records that look something like:

{ operator: 'John',
  total_actions: 10576,
  actions: {
     up: { count: 2052, most: [{a: 92}, {b: 91}, {c: 82}, {qqz: 60}, {d: 54}]},
     down: { count: 8482, most: [{loc: count}, {loc: count}...]}
     strange: { count: 39, most: [{loc: count}...]}
     charm: {count: 3, most: ...}
  }
},
{ operator: 'Jane',
  total_actions: 38223,
  actions: {...}
}

I'm not really fixated on "most" being an array, and it certainly doesn't need to be sorted.

I keep getting stuck on pushing...

I originally wrote the second stage to group on operator/action and pushed # of actions at location onto the array, but there is no way to sort/limit an array.

{ $group: {
    _id: {
        operator: '$_id.operator',
        action: '$action'
    },
    action_count: {$sum: '$count'},
    locations: {
        $push: {
            location: '$_id.location',
            count: '$count'
        }
    }
}

Which further reduces the set down to:

{ _id: { operator: 'John', action: 'up' }, action_count: 2, locations: [{location: 'a', count: 2}] },
{ _id: { operator: 'John', action: 'down' }, action_count: 1, locations: [{location: 'a', count: 1}] },
{ _id: { operator: 'Jane', action: 'down' }, action_count: 2, locations: [{location: 'b', count: 1}, {location: 'c', count: 1}] },

My mongo-sense said that was wrong because locations is an array that could have potentially 1000s of entries in it per operator/action. Additionally, I have no operations for sorting and limiting this mess, but if I $unwind it at this stage, it seems like I've just reversed stage 2.

Question 1: What's the right way to proceed from here?

Thoughts:

So instead my next stage, non-intuitively, groups on locations because there could be a ton of locations and relatively few actions, and if I'm grouping on locations, I might be able to still achieve a total-action count before sorting/limiting the locations? I just don't know how to proceed onto the third stage...?

{ $group: {
    _id: {
        operator: '$_id.operator',
        location: '$_id.location'
    },
    actions: {
        $push: {
            action: '$_id.action',
            count: '$count'
        }
    }
}

Gives me:

{ _id: { operator: 'John', location: 'a' }, actions: [{action: 'up', count: 2}, {action: 'down', count: 1}] },
{ _id: { operator: 'Jane', location: 'b' }, actions: [{action: 'down', count: 1}]
{ _id: { operator: 'Jane', location: 'c' }, actions: [{action: 'down', count: 1}]
...

I'm not sure I should be using $push at all. Is this a case for $addToSet? My brain is melting.

What is the mongo-sane way to do NxM group reports?

Specifically, I want to count all of the actions performed by the user but only report on the top N locations.

Paul T.
  • 329
  • 3
  • 15

1 Answers1

0

Wow, this was a tough one... thanks to chridam's answer (not the selected answer, but the proper answer) to Mongodb: Select the top N rows from each group .

The correct solution is to, after the first group, sort all of the entries by count, descending, then group them by operator & action, and push the location onto a list (despite the fact the list could get huge), then project just the first few elements of the list, and continue.

The final pipeline looks like this:

{ $group: {
    _id: {
        operator: '$operator',
        action: '$action',
        location: '$location',
    },
    count: {$sum: 1}
}},

// make sure we push from highest count to least
{$sort: {count: -1}},

{$group: {
    _id: {
        operator: '$_id.operator',
        action: '$_id.action'
    },
    count: { $sum: '$count' },
    locations: {
        $push: {
            location: '$_id.location',
            count: '$count'
        }
    }
}},

{$group: {
    '_id': '$_id.operator',
    'total': {'$sum': '$count'},
    'actions': {
        '$push': {
            'action': '$_id.action',
            'count': '$count',
            // keep just the top 5 locations
            'top_locations': {'$slice': ['$locations', 5]}
        }
    }
}}
Paul T.
  • 329
  • 3
  • 15